Results 1 - 10
of
10
UCB system description for the WMT 2007 shared task
- In Proceedings of the ACL-2007 Workshop on Statistcal Machine Translation (WMT-07
, 2007
"... For the WMT 2007 shared task, the UC Berkeley team employed three techniques of interest. First, we used monolingual syntactic paraphrases to provide syntactic variety to the source training set sentences. Second, we trained two language models: a small in-domain model and a large out-ofdomain model ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
For the WMT 2007 shared task, the UC Berkeley team employed three techniques of interest. First, we used monolingual syntactic paraphrases to provide syntactic variety to the source training set sentences. Second, we trained two language models: a small in-domain model and a large out-ofdomain model. Finally, we made use of results from prior research that shows that cognate pairs can improve word alignments. We contributed runs translating English to Spanish, French, and German using various combinations of these techniques. 1
Automatic Identification of Cognates, False Friends, and Partial Cognates
, 2006
"... Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have differ ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have different meanings in all contexts. Partial cognates are pairs of words in two languages that have the same meaning in some, but not all, contexts. Detecting the actual meaning of a partial cognate in context can be useful for Machine Translation and Computer-Assisted Language Learning tools. Our research on cognate and false-friend words between two pair of languages (French and English in our case) consists in automatically classifying a pair of words from two languages as cognates or false friends. We use Machine Learning techniques with several measures of orthographic similarity as features for classification. We study the impact of selecting different features, averaging them, and combining them through Machine Learning techniques. The methods work on different pair of languages as long as a small amount of annotated pairs of words is provided as training data. In addition to the work done on cognate and false-friend identification we propose a
Combining Evidence in Cognate Identification
, 2004
"... Cognates are words of the same origin that belong to distinct languages. The problem of automatic identification of cognates arises in language reconstruction and bitext-related tasks. The evidence of cognation may come from various information sources, such as phonetic similarity, semantic similari ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Cognates are words of the same origin that belong to distinct languages. The problem of automatic identification of cognates arises in language reconstruction and bitext-related tasks. The evidence of cognation may come from various information sources, such as phonetic similarity, semantic similarity, and recurrent sound correspondences. I discuss ways of defining the measures of the various types of similarity and propose a method of combining then into an integrated cognate identification program. The new method requires no manual parameter tuning and performs well when tested on the Indoeuropean and Algonquian lexical data.
Improved word alignments using the web as a corpus
- In Proceedings of RANLP’07
, 2007
"... We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We propose a novel method for improving word alignments in a parallel sentence-aligned bilingual corpus based on the idea that if two words are translations of each other then so should be many words in their local contexts. The idea is formalised using the Web as a corpus, a glossary of known word translations (dynamically augmented from the Web using bootstrapping), the vector space model, linguistically motivated weighted minimum edit distance, competitive linking, and the IBM models. Evaluation results on a Bulgarian-Russian corpus show a sizable improvement both in word alignment and in translation quality.
Cognate or false friend? Ask the Web
- In Proceedings of the RANLP’2007 workshop: Acquisition and management of multilingual lexicons
, 2007
"... We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a g ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic “bridges”, and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs shows this is a very promising approach.
Multilingual Cognate Identification using Integer Linear Programming
"... Abstract The identification of cognates in natural languages is a crucial part of automatic translation lexicon construction and other multilingual lexical tasks. We present new methods for multilingual cognate identification using the global inference framework of Integer Linear Programming. While ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract The identification of cognates in natural languages is a crucial part of automatic translation lexicon construction and other multilingual lexical tasks. We present new methods for multilingual cognate identification using the global inference framework of Integer Linear Programming. While previous approaches to cognate identification have focused on pairs of natural languages, we provide a methodology that directly forms sets of cognates across groups of languages. We show improvements over simple clustering techniques that do not inherently consider the transitivity of cognate relations. Furthermore, we show that formulations that jointly link cognates across groups of natural languages achieve higher performance than traditional pairwise approaches. We also describe applications of our technique to other important problems in multilingual natural language processing.
IADIS European Conference Data Mining 2008 MINING POSTAL ADDRESSES
"... This paper presents FuMaS (Fuzzy Matching System), a system capable of an efficient retrieval of postal addresses from noisy queries. The fuzzy postal addresses retrieval has many possible applications, ranging from datawarehouse dedumping, to the correction of input forms, or the integration within ..."
Abstract
- Add to MetaCart
This paper presents FuMaS (Fuzzy Matching System), a system capable of an efficient retrieval of postal addresses from noisy queries. The fuzzy postal addresses retrieval has many possible applications, ranging from datawarehouse dedumping, to the correction of input forms, or the integration within online street directories, etc. This paper presents the system architecture along with a series of experiments performed using FuMaS. The experimental results show that FuMaS is a very useful system when retrieving noisy postal addresses, being able to retrieve almost 85 % of the total ones. This represents an improvement of the 15 % when comparing with other systems tested in this set of experiments.
String Similarity Measures and PAM-like Matrices for Cognate Identification
"... We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environmen ..."
Abstract
- Add to MetaCart
We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence alignment, in order to produce substitution matrices. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We develop and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adjusted to datasets using any other alphabet, including the phonetic alphabet if data in phonetic transcription were available. We compare our system with other models reported in the literature and the results show that our method outperforms in terms of precision both orthographic and phonetic approaches formerly presented.

