Results 1 -
5 of
5
An unsupervised morpheme-based hmm for hebrew morphological disambiguation
- In COLING/ACL2006
, 2006
"... Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with mor ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with morphological disambiguation of the Hebrew language, which combines morphemes into a word in both agglutinative and fusional ways. We present an unsupervised stochastic model – the only resource we use is a morphological analyzer – which deals with the data sparseness problem caused by the affixational morphology of the Hebrew language. We present a text encoding method for languages with affixational morphology in which the knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt HMM algorithms for learning and searching this text representation, in such a way that segmentation and tagging can be learned in parallel in one step. Results on a large scale evaluation indicate that this learning improves disambiguation for complex tag sets. Our method is applicable to other languages with affix morphology. 1
A semisupervised learning approach for morpheme segmentation for an Arabic dialect
- Proceedings of Interspeech
"... We present a semi-supervised learning approach which utilizes a heuristic model for learning morpheme segmentation for Arabic dialects. We evaluate our approach by applying morpheme segmentation to the training data of a statistical machine translation (SMT) system. Experiments show that our approac ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We present a semi-supervised learning approach which utilizes a heuristic model for learning morpheme segmentation for Arabic dialects. We evaluate our approach by applying morpheme segmentation to the training data of a statistical machine translation (SMT) system. Experiments show that our approach is less sensitive to the availability of annotated stems than a previous rule-based approach and learns 12 % more segmentations on our Iraqi Arabic data. When applied in an SMT system, our approach yields a 8 % relative reduction in the training vocabulary size and a 0.8 % relative reduction in the out-of-vocabulary (OOV) rate on the test set, again as compared to the rule-based approach. Finally, our approach also results in a modest increase in BLEU scores. Index Terms: Iraqi Arabic, morpheme segmentation 1.
Building an International Corpus of Arabic (ICA): Progress of Compilation Stage
"... This paper focuses on three axes. The first axis gives a survey of the importance of corpora in language studies e.g. lexicography, grammar, semantics, Natural Language Processing and other areas. The second axis demonstrates how the Arabic language lacks textual resources, such as corpora and tools ..."
Abstract
- Add to MetaCart
This paper focuses on three axes. The first axis gives a survey of the importance of corpora in language studies e.g. lexicography, grammar, semantics, Natural Language Processing and other areas. The second axis demonstrates how the Arabic language lacks textual resources, such as corpora and tools for corpus analysis and the effected of this lack on the quality of Arabic language applications. There are rarely successful trials in compiling Arabic corpora, therefore, the third axis presents the technical design of the International Corpus of Arabic (ICA), a newly established representative corpus of Arabic that is intended to cover the Arabic language as being used all over the Arab world. The corpus is planned to support various Arabic studies that depends on authentic data, in addition to building Arabic Natural Language Processing Applications. 1
Methods for Amharic Part-of-Speech Tagging
"... The paper describes a set of experiments involving the application of three state-ofthe-art part-of-speech taggers to Ethiopian Amharic, using three different tagsets. The taggers showed worse performance than previously reported results for English, in particular having problems with unknown words. ..."
Abstract
- Add to MetaCart
The paper describes a set of experiments involving the application of three state-ofthe-art part-of-speech taggers to Ethiopian Amharic, using three different tagsets. The taggers showed worse performance than previously reported results for English, in particular having problems with unknown words. The best results were obtained using a Maximum Entropy approach, while HMM-based and SVMbased taggers got comparable results. 1
Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation
"... This paper is about improving the quality of Arabic-English statistical machine translation (SMT) on dialectal Arabic text using morphological knowledge. We present a light-weight rule-based approach to producing Modern Standard Arabic (MSA) paraphrases of dialectal Arabic out-of-vocabulary (OOV) wo ..."
Abstract
- Add to MetaCart
This paper is about improving the quality of Arabic-English statistical machine translation (SMT) on dialectal Arabic text using morphological knowledge. We present a light-weight rule-based approach to producing Modern Standard Arabic (MSA) paraphrases of dialectal Arabic out-of-vocabulary (OOV) words and low frequency words. Our approach extends an existing MSA analyzer with a small number of morphological clitics, and uses transfer rules to generate paraphrase lattices that are input to a state-of-the-art phrasebased SMT system. This approach improves BLEU scores on a blind test set by 0.56 absolute BLEU (or 1.5 % relative). A manual error analysis of translated dialectal words shows that our system produces correct translations in 74 % of the time for OOVs and 60 % of the time for low frequency words. 1

