Results 11 -
19 of
19
Semantic Evidence for Automatic Identification of Cognates
"... The identification of cognate word pairs has recently started to attract the attention of NLP research, but it is still a rather unexplored area requiring more focused attention. This paper builds on a purely orthographic approach to this task by introducing semantic evidence in the form of monoling ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The identification of cognate word pairs has recently started to attract the attention of NLP research, but it is still a rather unexplored area requiring more focused attention. This paper builds on a purely orthographic approach to this task by introducing semantic evidence in the form of monolingual thesauri and corpora to support the identification process. The proposed method is easily portable between languages and specialisation domains, since it does not depend on the availability of parallel texts or extensive knowledge resources, requiring only monolingual corpora and a bilingual dictionary encoding correspondences only the core vocabularies of both languages. Our evaluation of the method on four different language pairs suggests that the introduction of semantic evidence in cognate detection helps to substantially increase the precision of cognate identification.
Constraint Driven Transliteration Discovery 1
"... This paper introduces a novel constraint-driven learning framework for identifying named-entity (NE) transliterations. Traditional approaches to the problem of discovering transliterations depend heavily on correctly segmenting the target and the transliteration candidate and on and aligning these s ..."
Abstract
- Add to MetaCart
This paper introduces a novel constraint-driven learning framework for identifying named-entity (NE) transliterations. Traditional approaches to the problem of discovering transliterations depend heavily on correctly segmenting the target and the transliteration candidate and on and aligning these segments. In this work we propose to formulate the process of aligning segments as a constrained optimization problem. We consider the aligned segments as a latent feature representation and show how to infer an optimal latent representation and how to use it in order to learn an improved discriminative transliteration classifier. Our algorithm is an EM-like iterative algorithm that alternates between an optimization step for the latent representation and a learning step for the classifier’s parameters. We apply this method both in supervised and unsupervised settings, and show that our model can significantly outperform previous methods trained using considerably more resources. 1
Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora
"... Translation of named entities (NEs), such as person, organization, country, and location names is very important for several natural language processing applications. It plays a vital role in applications like cross lingual information retrieval, and machine translation. Web and news documents intro ..."
Abstract
- Add to MetaCart
Translation of named entities (NEs), such as person, organization, country, and location names is very important for several natural language processing applications. It plays a vital role in applications like cross lingual information retrieval, and machine translation. Web and news documents introduce new named entities on regular basis. Those new names cannot be captured by ordinary machine translation systems. In this paper, we introduce a framework for extracting named entity translation pairs. The framework contains methods for exploiting both comparable and parallel corpora to generate a regularly updated list of named entity translation pairs. We evaluate the quality of the extracted translation pairs by showing that it improves the performance of a named entity translation system.
NEWS 2009 Machine Transliteration Shared Task System Description: Transliteration with Letter-to-Phoneme Technology
"... We interpret the problem of transliterating English named entities into Hindi or Japanese Katakana as a variant of the letter-to-phoneme (L2P) subtask of textto-speech processing. Therefore, we apply a re-implementation of a state-of-the-art, discriminative L2P system (Jiampojamarn et al., 2008) to ..."
Abstract
- Add to MetaCart
We interpret the problem of transliterating English named entities into Hindi or Japanese Katakana as a variant of the letter-to-phoneme (L2P) subtask of textto-speech processing. Therefore, we apply a re-implementation of a state-of-the-art, discriminative L2P system (Jiampojamarn et al., 2008) to the problem, without further modification. In doing so, we hope to provide a baseline for the NEWS 2009 Machine Transliteration Shared Task (Li et al., 2009), indicating how much can be achieved without transliteration-specific technology. This paper briefly summarizes the original work and our reimplementation. We also describe a bug in our submitted implementation, and provide updated results on the development and test sets. 1
Discriminative Methods for Transliteration
"... We present two discriminative methods for name transliteration. The methods correspond to local and global modeling approaches in modeling structured output spaces. Both methods do not require alignment of names in different languages – their features are computed directly from the names themselves. ..."
Abstract
- Add to MetaCart
We present two discriminative methods for name transliteration. The methods correspond to local and global modeling approaches in modeling structured output spaces. Both methods do not require alignment of names in different languages – their features are computed directly from the names themselves. We perform an experimental evaluation of the methods for name transliteration from three languages (Arabic, Korean, and Russian) into English, and compare the methods experimentally to a state-of-theart joint probabilistic modeling approach. We find that the discriminative methods outperform probabilistic modeling, with the global discriminative modeling approach achieving the best performance in all languages. 1
Cross-lingual Slot Filling from Comparable Corpora
"... This paper introduces a new task of crosslingual slot filling which aims to discover attributes for entity queries from crosslingual comparable corpora and then present answers in a desired language. It is a very challenging task which suffers from both information extraction and machine translation ..."
Abstract
- Add to MetaCart
This paper introduces a new task of crosslingual slot filling which aims to discover attributes for entity queries from crosslingual comparable corpora and then present answers in a desired language. It is a very challenging task which suffers from both information extraction and machine translation errors. In this paper we analyze the types of errors produced by five different baseline approaches, and present a novel supervised rescoring based validation approach to incorporate global evidence from very large bilingual comparable corpora. Without using any additional labeled data this new approach obtained 38.5 % relative improvement in Precision and 86.7 % relative improvement in Recall over several state-of-the-art approaches. The ultimate system outperformed monolingual slot filling pipelines built on much larger monolingual corpora. 1
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Learning Bilingual Lexicons Using the Visual Similarity of Labeled Web Images
"... Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully le ..."
Abstract
- Add to MetaCart
Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features. We generate bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects. The use of visual similarity substantially improves performance over standard approaches based on string similarity: for generated lexicons with 1000 translations, including visual information leads to an absolute improvement in accuracy of 8-12 % over string edit distance alone. 1
Improved Transliteration Mining Using Graph Reinforcement
"... Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to ..."
Abstract
- Add to MetaCart
Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for the four language pairs respectively.

