Results 1 -
9 of
9
Can corpus based measures be used for comparative study of languages
- In Proceedings of the ACL Workshop Computing and Historical Phonology
, 2007
"... Quantitative measurement of inter-language distance is a useful technique for studying diachronic and synchronic relations between languages. Such measures have been used successfully for purposes like deriving language taxonomies and language reconstruction, but they have mostly been applied to han ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Quantitative measurement of inter-language distance is a useful technique for studying diachronic and synchronic relations between languages. Such measures have been used successfully for purposes like deriving language taxonomies and language reconstruction, but they have mostly been applied to handcrafted word lists. Can we instead use corpus based measures for comparative study of languages? In this paper we try to answer this question. We use three corpus based measures and present the results obtained from them and show how these results relate to linguistic and historical knowledge. We argue that the answer is yes and that such studies can provide or validate linguistic and computational insights. 1
Automatic Identification of Cognates, False Friends, and Partial Cognates
, 2006
"... Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have differ ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Cognates are words in different languages that have similar spelling and meaning. They can help second-language learners with vocabulary expansion and reading comprehension tasks. Special attention needs to be paid to pairs of words that appear similar but are in fact false friends: they have different meanings in all contexts. Partial cognates are pairs of words in two languages that have the same meaning in some, but not all, contexts. Detecting the actual meaning of a partial cognate in context can be useful for Machine Translation and Computer-Assisted Language Learning tools. Our research on cognate and false-friend words between two pair of languages (French and English in our case) consists in automatically classifying a pair of words from two languages as cognates or false friends. We use Machine Learning techniques with several measures of orthographic similarity as features for classification. We study the impact of selecting different features, averaging them, and combining them through Machine Learning techniques. The methods work on different pair of languages as long as a small amount of annotated pairs of words is provided as training data. In addition to the work done on cognate and false-friend identification we propose a
Cognate or false friend? Ask the Web
- In Proceedings of the RANLP’2007 workshop: Acquisition and management of multilingual lexicons
, 2007
"... We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a g ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We propose a novel unsupervised semantic method for distinguishing cognates from false friends. The basic intuition is that if two words are cognates, then most of the words in their respective local contexts should be translations of each other. The idea is formalised using the Web as a corpus, a glossary of known word translations used as cross-linguistic “bridges”, and the vector space model. Unlike traditional orthographic similarity measures, our method can easily handle words with identical spelling. The evaluation on 200 Bulgarian-Russian word pairs shows this is a very promising approach.
Dialect Pronunciation Comparison and Spoken Word Recognition
"... Two adaptations of the regular Levenshtein distance algorithm are proposed based on psycholinguistic work on spoken word recognition. The first adaptation is inspired by the Cohort model which assumes that the word-initial part is more important for word recognition than the word-final part. The sec ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Two adaptations of the regular Levenshtein distance algorithm are proposed based on psycholinguistic work on spoken word recognition. The first adaptation is inspired by the Cohort model which assumes that the word-initial part is more important for word recognition than the word-final part. The second adaptation is based on the notion that stressed syllables contain more information and are more important for word recognition than unstressed syllables. The adapted algorithms are evaluated on a large contemporary collection of Dutch dialect material, the Goeman-Taeldeman-Van Reenen-Project (GTRP, collected 1980–1995) and a relatively small Norwegian dataset for which dialect speakers judgments of proximity is available.
Phonological Reconstruction of a Dead Language Using the Gradual Learning Algorithm
"... This paper discusses the reconstruction of the Elamite language’s phonology from its orthography using the Gradual Learning Algorithm, which was re-purposed to “learn” underlying phonological forms from surface orthography. Practical issues are raised regarding the difficulty of mapping between orth ..."
Abstract
- Add to MetaCart
This paper discusses the reconstruction of the Elamite language’s phonology from its orthography using the Gradual Learning Algorithm, which was re-purposed to “learn” underlying phonological forms from surface orthography. Practical issues are raised regarding the difficulty of mapping between orthography and phonology, and Optimality Theory’s neglected Lexicon Optimization module is highlighted. 1
String Similarity Measures and PAM-like Matrices for Cognate Identification
"... We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environmen ..."
Abstract
- Add to MetaCart
We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence alignment, in order to produce substitution matrices. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We develop and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adjusted to datasets using any other alphabet, including the phonetic alphabet if data in phonetic transcription were available. We compare our system with other models reported in the literature and the results show that our method outperforms in terms of precision both orthographic and phonetic approaches formerly presented.
Levenshtein Distances Fail to Identify Language Relationships Accurately
"... The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance ..."
Abstract
- Add to MetaCart
The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40 % of the time. Standardizing the orthography increases the performance, but only to a maximum of 65 % accuracy within language subgroups. The accuracy of the Levenshtein classification decreases rapidly with phylogenetic distance, failing to discriminate homology and chance similarity across distantly related languages. This poor performance suggests the need for more linguistically nuanced methods for automated language classification tasks. 1.
Dialectometryþþ
"... 10 Dialectology is one of the sub-disciplines in the humanities that embraced digital techniques early on. The use of computational and quantitative techniques in dialectology is known as ‘dialectometry’. The present collection of articles contain several which proudly continue working within dialec ..."
Abstract
- Add to MetaCart
10 Dialectology is one of the sub-disciplines in the humanities that embraced digital techniques early on. The use of computational and quantitative techniques in dialectology is known as ‘dialectometry’. The present collection of articles contain several which proudly continue working within dialectometry’s usual assumptions and toward its established goals, honing existing techniques, and experi-15 menting with novel ones, but also, significantly, several articles that depart Correspondence: John Nerbonne, Center for deliberately from earlier modes, returning to individual phenomena (as opposed Language and Cognition, to aggregates), examining new sources of data (not taken from atlases), applying University of Groningen, dialectometric techniques to sociolinguistic and diachronic research questions,

