Results 1 - 10
of
26
Unsupervised models for morpheme segmentation and morphology learning
- ACM Trans. Speech Lang. Process
, 2007
"... We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequence ..."
Abstract
-
Cited by 32 (6 self)
- Add to MetaCart
We present a model family called Morfessor for the unsupervised induction of a simple morphology from raw text data. The model is formulated in a probabilistic maximum a posteriori framework. Morfessor can handle highly inflecting and compounding languages where words can consist of lengthy sequences of morphemes. A lexicon of word segments, called morphs, is induced from the data. The lexicon stores information about both the usage and form of the morphs. Several instances of the model are evaluated quantitatively in a morpheme segmentation task on different sized sets of Finnish as well as English data. Morfessor is shown to perform very well compared to a widely known benchmark algorithm, in particular on Finnish data.
Inducing the Morphological Lexicon of a Natural Language from Unannotated Text
- In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05
, 2005
"... This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated ..."
Abstract
-
Cited by 23 (7 self)
- Add to MetaCart
This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the “meaning ” and “form ” of the morphs it contains. These parameters affect the role of the morphs in words. The model is implemented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are obtained for Finnish and almost as good results are obtained in the English task. 1.
Minimized models for unsupervised part-of-speech tagging
- In In Proc. ACL
, 2009
"... We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different stan ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
We describe a novel method for the task of unsupervised POS tagging with a dictionary, one that uses integer programming to explicitly search for the smallest model that explains the data, and then uses EM to set parameter values. We evaluate our method on a standard test corpus using different standard tagsets (a 45-tagset as well as a smaller 17-tagset), and show that our approach performs better than existing state-of-the-art systems in both settings. 1
Highperformance, language-independent morphological segmentation
- In Proceedings of (NAACL-HLT
, 2007
"... This paper introduces an unsupervised morphological segmentation algorithm that shows robust performance for four languages with different levels of morphological complexity. In particular, our algorithm outperforms Goldsmith’s Linguistica and Creutz and Lagus’s Morphessor for English and Bengali, a ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper introduces an unsupervised morphological segmentation algorithm that shows robust performance for four languages with different levels of morphological complexity. In particular, our algorithm outperforms Goldsmith’s Linguistica and Creutz and Lagus’s Morphessor for English and Bengali, and achieves performance that is comparable to the best results for all three PASCAL evaluation datasets. Improvements arise from (1) the use of relative corpus frequency and suffix level similarity for detecting incorrect morpheme attachments and (2) the induction of orthographic rules and allomorphs for segmenting words where roots exhibit spelling changes during morpheme attachments. 1
Unsupervised Part-of-Speech Acquisition for Resource-Scarce Languages
, 2007
"... This paper proposes a new bootstrapping approach to unsupervised part-of-speech induction. In comparison to previous bootstrapping algorithms developed for this problem, our approach aims to improve the quality of the seed clusters by employing seed words that are both distributionally and morpholog ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper proposes a new bootstrapping approach to unsupervised part-of-speech induction. In comparison to previous bootstrapping algorithms developed for this problem, our approach aims to improve the quality of the seed clusters by employing seed words that are both distributionally and morphologically reliable. In particular, we present a novel method for combining morphological and distributional information for seed selection. Experimental results demonstrate that our approach works well for English and Bengali, thus providing suggestive evidence that it is applicable to both morphologically impoverished languages and highly inflectional languages.
Unsupervsied segmentation of words into morphemes – Challenge 2005: An introduction and evaluation report
- In Proc. of 2nd Pascal Challenges Workshop
, 2006
"... ..."
Morph-Based Speech Recognition and Modeling of Out-of-Vocabulary Words Across Languages
"... We explore the use of morph-based language models in large-vocabulary continuous speech recognition systems across four so-called “morphologically rich ” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We explore the use of morph-based language models in large-vocabulary continuous speech recognition systems across four so-called “morphologically rich ” languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception, since here the standard word model outperforms the morph model. Differences in the data sets and the amount of data are discussed as a plausible explanation.
Unsupervised knowledge-free morpheme boundary detection
- In Proc. of the RANLP’05
, 2005
"... This paper presents a revised version of an unsupervised and knowledge-free morpheme boundary detection algorithm based on letter successor variety (LSV) and a trie classifier [5]. Additionally a morphemic analysis based on contextual similarity provides knowledge about relatedness of the found morp ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper presents a revised version of an unsupervised and knowledge-free morpheme boundary detection algorithm based on letter successor variety (LSV) and a trie classifier [5]. Additionally a morphemic analysis based on contextual similarity provides knowledge about relatedness of the found morphs. For the boundary detection the challenge of increasing recall of found morphs while retaining a high precision is tackled by adding a compound splitter, iterating the LSV analysis and dividing the trie classifier into two distinctly applied clasifiers. The result is a significantly improved overall performance and a decreased reliance on corpus size. Further possible improvements and analyses are discussed.
Morphologically motivated language models in speech recognition
- in Proceedings of International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning
, 2005
"... Language modelling in large vocabulary speech recognition has traditionally been based on words. A lexicon of the most common words of the language in question is created and the recogniser is limited to consider only the words in the lexicon. In Finnish, however, it is more difficult to create an e ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Language modelling in large vocabulary speech recognition has traditionally been based on words. A lexicon of the most common words of the language in question is created and the recogniser is limited to consider only the words in the lexicon. In Finnish, however, it is more difficult to create an extensive lexicon, since the compounding of words, numerous inflections and suffixes increase the number of commonly used word forms considerably. The problem is that reasonably sized lexica lack many common words, and for very large lexica, it is hard to estimate a reliable language model. We have previously reported a new approach for improving the recognition of inflecting or compounding languages in large vocabulary continuous speech recognition tasks. Significant reductions in error rates have been obtained by replacing a traditional word lexicon with a lexicon based on morpheme-like word fragments learnt directly from data. In this paper, we evaluate these so called statistical morphs further, and compare them to grammatical morphs and very large word lexica using n-gram language models of different orders. When compared to the best word model, the morph models seem to be clearly more effective with respect to entropy, and give 30 % relative error-rate reductions in a Finnish recognition task. Furthermore, the statistical morphs seem to be slightly better than the rule-based grammatical morphs. 1.
Unsupervised morpheme analysis evaluation by IR experiments -- Morpho Challenge 2007
- IN WORKING NOTES FOR THE CLEF 2007 WORKSHOP
, 2007
"... This paper presents the evaluation of Morpho Challenge Competition 2 (information retrieval). The Competition 1 (linguistic gold standard) is described in a companion paper. In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (s ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
This paper presents the evaluation of Morpho Challenge Competition 2 (information retrieval). The Competition 1 (linguistic gold standard) is described in a companion paper. In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling In this paper the morpheme analysis submitted by the Challenge participants were evaluated by performing information retrieval (IR) experiments, where the words in the documents and queries were replaced by their proposed morpheme representations and the search was based on morphemes instead of words. The IR evaluations were provided for three languages: Finnish, German, and English and the participants were encouraged to apply their algorithm to all of them. The challenge organizers performed the IR experiments using the queries, texts, and relevance judgments available in CLEF forum and morpheme analysis methods submitted by the challenge participants. The results show that the morpheme analysis has a significant effect in IR performance in all languages, and that the performance of the best unsupervised methods can be superior to the supervised reference methods.

