Results 1 - 10
of
16
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
, 2005
"... We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the o ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.
Arabic preprocessing schemes for statistical machine translation
- in Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
, 2006
"... Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality. 1
Part-ofSpeech Tagging of Modern Hebrew Text
- Journal of Natural Language Engineering
, 2007
"... Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a m ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we will aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew
Combination of Arabic Preprocessing Schemes for Statistical
- Machine Translation”, Proceedings of COLING/ACL, 2006
"... Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Statistical machine translation is quite robust when it comes to the choice of input representation. It only requires consistency between training and testing. As a result, there is a wide range of possible preprocessing choices for data used in statistical machine translation. This is even more so for morphologically rich languages such as Arabic. In this paper, we study the effect of different word-level preprocessing schemes for Arabic on the quality of phrase-based statistical machine translation. We also present and evaluate different methods for combining preprocessing schemes resulting in improved translation quality. 1
Challenges in Building an Arabic-English GHMT System with SMT Components
- IN PROCEEDINGS OF THE 11TH ANNUAL CONFERENCE OF THE EUROPEAN ASSOCIATION FOR MACHINE TRANSLATION (EAMT-2006
, 2006
"... The research context of this paper is developing hybrid machine translation (MT) systems that exploit the advantages of linguistic rule-based and statistical MT systems. Arabic, as a morphologically rich language, is especially challenging even without addressing the hybridization question. In this ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The research context of this paper is developing hybrid machine translation (MT) systems that exploit the advantages of linguistic rule-based and statistical MT systems. Arabic, as a morphologically rich language, is especially challenging even without addressing the hybridization question. In this paper, we describe the challenges in building an Arabic-English generation-heavy machine translation (GHMT) system and boosting it with statistical machine translation (SMT) components. We present an extensive evaluation of multiple system variants and report positive results on the advantages of hybridization.
Diacritization: A Challenge to Arabic Treebank Annotation and Parsing
- IN PROCEEDINGS OF THE BRITISH COMPUTER SOCIETY ARABIC NLP/MT CONFERENCE
, 2006
"... ..."
Improved Arabic Base Phrase Chunking with a new enriched POS tag set
"... Base Phrase Chunking (BPC) or shallow syntactic parsing is proving to be a task of interest to many natural language processing applications. In this paper, A BPC system is introduced that improves over state of the art performance in BPC using a new part of speech tag (POS) set. The new POS tag set ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Base Phrase Chunking (BPC) or shallow syntactic parsing is proving to be a task of interest to many natural language processing applications. In this paper, A BPC system is introduced that improves over state of the art performance in BPC using a new part of speech tag (POS) set. The new POS tag set, ERTS, reflects some of the morphological features specific to Modern Standard Arabic. ERTS explicitly encodes definiteness, number and gender information increasing the number of tags from 25 in the standard LDC reduced tag set to 75 tags. For the BPC task, we introduce a more language specific set of definitions for the base phrase annotations. We employ a support vector machine approach for both the POS tagging and the BPC processes. The POS tagging performance using this enriched tag set, ERTS, is at 96.13 % accuracy. In the BPC experiments, we vary the feature set along two factors: the POS tag set and a set of explicitly encoded morphological features. Using the ERTS POS tagset, BPC achieves the highest overall Fβ=1 of 96.33 % on 10 different chunk types outperforming the use of the standard POS tag set even when explicit morphological features are present. 1
Overcoming Vocabulary Sparsity in MT Using Lattices
"... Source languages with complex wordformation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Source languages with complex wordformation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation. 1
Arabic Diacritization in the Context of Statistical Machine Translation
"... Diacritics in Arabic are optional orthographic symbols typically representing short vowels. Most Arabic text is underspecified for diacritics. However, we do observe partial diacritization depending on genre and domain. In this paper, we investigate the impact of Arabic diacritization on statistical ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Diacritics in Arabic are optional orthographic symbols typically representing short vowels. Most Arabic text is underspecified for diacritics. However, we do observe partial diacritization depending on genre and domain. In this paper, we investigate the impact of Arabic diacritization on statistical machine translation (SMT). We define several diacritization schemes ranging from full to partial diacritization. We explore the impact of the defined schemes on SMT in two different modes which tease apart the effect of diacritization on the alignment and its consequences on decoding. Our results show that none of the partial diacritization schemes significantly varies in performance from the no-diacritization baseline despite the increase in the number of types in the data. However, a full diacritization scheme performs significantly worse than no diacritization. Crucially, our research suggests that the SMT performance is positively correlated with the increase in the number of tokens correctly affected by a diacritization scheme and the high F-score of the automatic assignment of the particular diacritic. 1
Semeval 2007 Task 18: Arabic Semantic Labeling
"... In this paper, we present the details of the Arabic Semantic Labeling task. We describe some of the features of Arabic that are relevant for the task. The task comprises two subtasks: Arabic word sense disambiguation and Arabic semantic role labeling. The task focuses on modern standard Arabic. 1 ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we present the details of the Arabic Semantic Labeling task. We describe some of the features of Arabic that are relevant for the task. The task comprises two subtasks: Arabic word sense disambiguation and Arabic semantic role labeling. The task focuses on modern standard Arabic. 1

