Results 1 -
4 of
4
Evaluation of several phonetic similarity algorithms on the task of cognate identification
- In COLING-ACL Workshop on Linguistic Distances
, 2006
"... We investigate the problem of measuring phonetic similarity, focusing on the identification of cognates, words of the same origin in different languages. We compare representatives of two principal approaches to computing phonetic similarity: manually-designed metrics, and learning algorithms. In pa ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
We investigate the problem of measuring phonetic similarity, focusing on the identification of cognates, words of the same origin in different languages. We compare representatives of two principal approaches to computing phonetic similarity: manually-designed metrics, and learning algorithms. In particular, we consider a stochastic transducer, a Pair HMM, several DBN models, and two constructed schemes. We test those approaches on the task of identifying cognates among Indoeuropean languages, both in the supervised and unsupervised context. Our results suggest that the averaged context DBN model and the Pair HMM achieve the highest accuracy given a large training set of positive examples. 1
OOV DETECTION BY JOINT WORD/PHONE LATTICE ALIGNMENT
"... We propose a new method for detecting out-of-vocabulary (OOV) words for large vocabulary continuous speech recognition (LVCSR) systems. Our method is based on performing a joint alignment between independently generated word and phone lattices, where the word-lattice is aligned via a recognition lex ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We propose a new method for detecting out-of-vocabulary (OOV) words for large vocabulary continuous speech recognition (LVCSR) systems. Our method is based on performing a joint alignment between independently generated word and phone lattices, where the word-lattice is aligned via a recognition lexicon. Based on a similarity measure between phones, we can locate highly mis-aligned regions of time, and then specify those regions as candidate OOVs. This novel approach is implemented using the framework of graphical models (GMs), which enable fast flexible integration of different scores from word lattices, phone lattices, and the similarity measures. We evaluate our method on switchboard data using RT-04 as test set. Experimental results show that our approach provides a promising and scalable new way to detect OOV for LVCSR. Index Terms — out-of-vocabulary, OOV, lattices, graphical models, Bayesian networks, dynamic Bayesian networks
Genetic triangulation of graphical models for speech and language processing
- In Proc. Eurospeech
, 2005
"... Graphical models are an increasingly popular approach for speech and language processing. As researchers design ever more complex models it becomes crucial to find triangulations that make inference problems tractable. This paper presents a genetic algorithm for triangulation search that is well-sui ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Graphical models are an increasingly popular approach for speech and language processing. As researchers design ever more complex models it becomes crucial to find triangulations that make inference problems tractable. This paper presents a genetic algorithm for triangulation search that is well-suited for speech and language graphical models. It is unique in two ways: First, it can find triangulations appropriate for graphs with a mix of stochastic and deterministic dependencies. Second, the search is guided by optimizing the inference speed (CPU runtime) on real data. We show results on 10 real-world speech and language graphs and demonstrate inference speed-ups over standard triangulation methods. 1.
String Similarity Measures and PAM-like Matrices for Cognate Identification
"... We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environmen ..."
Abstract
- Add to MetaCart
We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a PAM-like technique, similar to the one successfully used in biological sequence alignment, in order to produce substitution matrices. We propose a novel family of parameterised string similarity measures and we apply them together with the PAM-like matrices to the task of cognate identification. We develop and test our proposal on standard datasets of Indo-European languages in orthographic format based on the Latin alphabet, but it could easily be adjusted to datasets using any other alphabet, including the phonetic alphabet if data in phonetic transcription were available. We compare our system with other models reported in the literature and the results show that our method outperforms in terms of precision both orthographic and phonetic approaches formerly presented.

