• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Multipath translation lexicon induction via bridgle languages (2001)

by G Mann, D Yarowsky
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 27
Next 10 →

Automatic Prediction of Cognate Orthography Using Support Vector Machines

by Andrea Mulloni
"... This paper describes an algorithm to automatically generate a list of cognates in a target language by means of Support Vector Machines. While Levenshtein distance was used to align the training file, no knowledge repository other than an initial list of cognates used for training purposes was input ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
This paper describes an algorithm to automatically generate a list of cognates in a target language by means of Support Vector Machines. While Levenshtein distance was used to align the training file, no knowledge repository other than an initial list of cognates used for training purposes was input into the algorithm. Evaluation was set up in a cognate production scenario which mimed a reallife situation where no word lists were available in the target language, delivering the ideal environment to test the feasibility of a more ambitious project that will involve language portability. An overall improvement of 50.58 % over the baseline showed promising horizons. 1

Cognate or false friend? Ask the Web

by Svetlin Nakov, Preslav Nakov - In Proceedings of the RANLP’2007 workshop: Acquisition and management of multilingual lexicons , 2007
"... We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a g ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic “bridges”, and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs shows this is a very promising approach.

Word-based dialect identification with georeferenced rules

by Yves Scherrer, Owen Rambow
"... We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
We present a novel approach for (written) dialect identification based on the discriminative potential of entire words. We generate Swiss German dialect words from a Standard German lexicon with the help of hand-crafted phonetic/graphemic rules that are associated with occurrence maps extracted from a linguistic atlas created through extensive empirical fieldwork. In comparison with a charactern-gram approach to dialect identification, our model is more robust to individual spelling differences, which are frequently encountered in non-standardized dialect writing. Moreover, it covers the whole Swiss German dialect continuum, which trained models struggle to achieve due to sparsity of training data. 1

Multilingual Cognate Identification using Integer Linear Programming

by Shane Bergsma, Grzegorz Kondrak
"... Abstract The identification of cognates in natural languages is a crucial part of automatic translation lexicon construction and other multilingual lexical tasks. We present new methods for multilingual cognate identification using the global inference framework of Integer Linear Programming. While ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract The identification of cognates in natural languages is a crucial part of automatic translation lexicon construction and other multilingual lexical tasks. We present new methods for multilingual cognate identification using the global inference framework of Integer Linear Programming. While previous approaches to cognate identification have focused on pairs of natural languages, we provide a methodology that directly forms sets of cognates across groups of languages. We show improvements over simple clustering techniques that do not inherently consider the transitivity of cognate relations. Furthermore, we show that formulations that jointly link cognates across groups of natural languages achieve higher performance than traditional pairwise approaches. We also describe applications of our technique to other important problems in multilingual natural language processing.

Finding Cognate Groups using Phylogenies

by David Hall, Dan Klein
"... A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from a ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline approach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words. 1

Semantic Evidence for Automatic Identification of Cognates

by Andrea Mulloni, Wolverhampton Wv Sb, Viktor Pekar, Wolverhampton Wv Sb
"... The identification of cognate word pairs has recently started to attract the attention of NLP research, but it is still a rather unexplored area requiring more focused attention. This paper builds on a purely orthographic approach to this task by introducing semantic evidence in the form of monoling ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
The identification of cognate word pairs has recently started to attract the attention of NLP research, but it is still a rather unexplored area requiring more focused attention. This paper builds on a purely orthographic approach to this task by introducing semantic evidence in the form of monolingual thesauri and corpora to support the identification process. The proposed method is easily portable between languages and specialisation domains, since it does not depend on the availability of parallel texts or extensive knowledge resources, requiring only monolingual corpora and a bilingual dictionary encoding correspondences only the core vocabularies of both languages. Our evaluation of the method on four different language pairs suggests that the introduction of semantic evidence in cognate detection helps to substantially increase the precision of cognate identification.

Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL),

by Pages Ann Arbor, Dayne Freitag, Matthias Blume, John Byrnes, Edmond Chow, Sadik Kapadia, Richard Rohwer, Zhiqiang Wang - In Proceedings of CoNLL2005 , 2005
"... Recent work on the problem of detecting synonymy through corpus analysis has used the Test of English as a Foreign Language (TOEFL) as a benchmark. However, this test involves as few as 80 questions, prompting questions regarding the statistical significance of reported results. ..."
Abstract - Add to MetaCart
Recent work on the problem of detecting synonymy through corpus analysis has used the Test of English as a Foreign Language (TOEFL) as a benchmark. However, this test involves as few as 80 questions, prompting questions regarding the statistical significance of reported results.

Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction

by Yves Scherrer
"... This paper compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between a Swiss German dialect and Standard German. The measures have been adapted to this particular language pair by training stochastic transducers with the Expectation-Maximisation a ..."
Abstract - Add to MetaCart
This paper compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between a Swiss German dialect and Standard German. The measures have been adapted to this particular language pair by training stochastic transducers with the Expectation-Maximisation algorithm or by using handmade transduction rules. These adaptive metrics show up to 11 % F-measure improvement over a static metric like Levenshtein distance. 1

Building Strong Multilingual Aligned Corpora

by Reza Bosagh Zadeh
"... Recent advances have allowed algorithms that learn from aligned natural language texts to exploit aligned sentences in more than two languages. We investigate ways of combining () N 2 bilingual aligned corpora together to create a multilingual aligned corpus across N languages. As a result of the co ..."
Abstract - Add to MetaCart
Recent advances have allowed algorithms that learn from aligned natural language texts to exploit aligned sentences in more than two languages. We investigate ways of combining () N 2 bilingual aligned corpora together to create a multilingual aligned corpus across N languages. As a result of the combination of several corpora, our algorithms output a multilingual corpus, with each aligned tuple assigned a quality score called ‘strength ’ that may be used when learning from the multilingual corpus. We show that the addition of bilingual corpora used with alignment strengths can significantly improve Statistical Machine Translation quality on an Arabic→English task. 1

String Similarity Measures and PAM-like Matrices for Cognate Identification

by Antonella Delmestri, Nello Cristianini
"... We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environmen ..."
Abstract - Add to MetaCart
We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence alignment, in order to produce substitution matrices. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We develop and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adjusted to datasets using any other alphabet, including the phonetic alphabet if data in phonetic transcription were available. We compare our system with other models reported in the literature and the results show that our method outperforms in terms of precision both orthographic and phonetic approaches formerly presented.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University