Results 1 - 10
of
67
Matching words and pictures
- JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation ..."
Abstract
-
Cited by 391 (33 self)
- Add to MetaCart
We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann’s hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation
A Phrase-Based, Joint Probability Model for Statistical Machine Translation
- In Proceedings of EMNLP
, 2002
"... We present a joint probability model for statistical machine translation, which automatically learns word and phrase equivalents from bilingual corpora. Translations produced with parameters estimated using the joint model are more accurate than translations produced using IBM Model 4. ..."
Abstract
-
Cited by 135 (2 self)
- Add to MetaCart
We present a joint probability model for statistical machine translation, which automatically learns word and phrase equivalents from bilingual corpora. Translations produced with parameters estimated using the joint model are more accurate than translations produced using IBM Model 4.
Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources
- In Proceedings of the 20th International Conference on Computational Linguistics
, 2004
"... We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy ..."
Abstract
-
Cited by 89 (1 self)
- Add to MetaCart
We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58 % on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7 % respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships. 1
Bitext Maps and Alignment via Pattern Recognition
- Computational Linguistics
, 1999
"... This article advances the state of the art ofbitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective ..."
Abstract
-
Cited by 68 (0 self)
- Add to MetaCart
This article advances the state of the art ofbitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR's accuracy is consistently high for language pairs as diverse as French/English and Korean/English. If necessary, S IMR's bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here. SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium) 1.
Statistical Machine Translation
- Final Report, JHU Summer Workshop
, 1999
"... Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must "know" the two languages---synonyms for words and phrases, grammars of the two ..."
Abstract
-
Cited by 67 (9 self)
- Add to MetaCart
Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must "know" the two languages---synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French. Recently, statistical data analysis has been used to gather MT knowledge automatically from parallel bilingual text. Unfortunately, these techniques and tools have not been dissem...
Monolingual machine translation for paraphrase generation
- In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing
, 2004
"... We apply statistical machine translation (SMT) tools to generate novel paraphrases of input sentences in the same language. The system is trained on large volumes of sentence pairs automatically extracted from clustered news articles available on the World Wide Web. Alignment Error Rate (AER) is mea ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
We apply statistical machine translation (SMT) tools to generate novel paraphrases of input sentences in the same language. The system is trained on large volumes of sentence pairs automatically extracted from clustered news articles available on the World Wide Web. Alignment Error Rate (AER) is measured to gauge the quality of the resulting corpus. A monotone phrasal decoder generates contextual replacements. Human evaluation shows that this system outperforms baseline paraphrase generation techniques and, in a departure from previous work, offers better coverage and scalability than the current best-of-breed paraphrasing approaches. 1
Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study
- IN PROCEEDINGS OF ACL03
, 2003
"... A central problem of word sense disambiguation (WSD) is the lack of manually sense-tagged data required for supervised learning. In this paper, we evaluate an approach to automatically acquire sensetagged training data from English-Chinese parallel corpora, which are then used for disambiguat ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
A central problem of word sense disambiguation (WSD) is the lack of manually sense-tagged data required for supervised learning. In this paper, we evaluate an approach to automatically acquire sensetagged training data from English-Chinese parallel corpora, which are then used for disambiguating the nouns in the SENSEVAL-2 English lexical sample task. Our investigation reveals that this method of acquiring sense-tagged data is promising. On a subset of the most difficult SENSEVAL-2 nouns, the accuracy difference between the two approaches is only 14.0%, and the difference could narrow further to 6.5% if we disregard the advantage that manually sense-tagged data have in their sense coverage. Our analysis also highlights the importance of the issue of domain dependence in evaluating WSD programs.
The effects of segmentation and feature choice in a translation model of object recognition
- In IEEE Conf. on Computer Vision and Pattern Recognition
, 2003
"... We work with a model of object recognition where words must be placed on image regions. This approach means that large scale experiments are relatively easy, so we can evaluate the effects of various early and midlevel vision algorithms on recognition performance. We evaluate various image segmentat ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
We work with a model of object recognition where words must be placed on image regions. This approach means that large scale experiments are relatively easy, so we can evaluate the effects of various early and midlevel vision algorithms on recognition performance. We evaluate various image segmentation algorithms by determining word prediction accuracy for images segmented in various ways and represented by various features. We take the view that good segmentations respect object boundaries, and so word prediction should be better for a better segmentation. However, it is usually very difficult in practice to obtain segmentations that do not break up objects, so most practitioners attempt to merge segments to get better putative object representations. We demonstrate that our paradigm of word prediction easily allows us to predict potentially useful segment merges, even for segments that do not look similar (for example, merging the black and white Figure 1. Illustration of labeling. Each region is labeled with the maximally probable word, but a probability distribution over all words is available for each region.
The effect of bilingual term list size on dictionary-based cross-language information retrieval
, 2003
"... Bilingual term lists are extensively used as a resource for dictionary-based Cross-Language Information Retrieval (CLIR), in which the goal is to find documents written in one natural language based on queries that are expressed in another. This paper identifies eight types of terms that affect retr ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
Bilingual term lists are extensively used as a resource for dictionary-based Cross-Language Information Retrieval (CLIR), in which the goal is to find documents written in one natural language based on queries that are expressed in another. This paper identifies eight types of terms that affect retrieval effectiveness in CLIR applications through their coverage by general-purpose bilingual term lists, and reports results from an experimental evaluation of the coverage of 35 bilingual term lists in news retrieval application. Retrieval effectiveness was found to be strongly influenced by term list size for lists that contain between 3,000 and 30,000 unique terms per language. Supplemental techniques for named entity translation were found to be useful with even the largest lexicons. The contribution of named entity translation was evaluated in a cross-language experiment involving English and Chinese. Smaller effects were observed from deficiencies in the coverage of domainspecific terminology when searching news stories.
Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing
- in International Journal of Speech Technology
, 2002
"... Abstract. Parallel corpora encode extremely valuable linguistic knowledge about paired languages, both in terms of vocabulary and syntax. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning o ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
Abstract. Parallel corpora encode extremely valuable linguistic knowledge about paired languages, both in terms of vocabulary and syntax. A professional translation of a text represents a series of linguistic decisions made by the translator in order to convey as faithfully as possible the meaning of the original text and to produce a “natural ” text from the perspective of a native speaker of the target language. The “naturalness ” of a translation implies not only the grammaticality of the translated text, but also style and cultural or social specificity. We describe a program that exploits the knowledge embedded in the parallel corpora and produces a set of translation equivalents (a translation lexicon). The program uses almost no linguistic knowledge, relying on statistical evidence and some simplifying assumptions. Our experiments were conducted on the MULTEXT-EAST multilingual parallel corpus (Orwell’s “1984”), and the evaluation of the system performance is presented in some detail in terms of precision, recall and processing time. We conclude by briefly mentioning some applications of the automatic extracted lexicons for text and speech processing. Keywords: alignment, bitext, lemmatization, tagging, translation lexicon

