Results 1 - 10
of
11
An unsupervised morpheme-based hmm for hebrew morphological disambiguation
- In COLING/ACL2006
, 2006
"... Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with mor ..."
Abstract
-
Cited by 29 (6 self)
- Add to MetaCart
Morphological disambiguation is the process of assigning one set of morphological features to each individual word in a text. When the word is ambiguous (there are several possible analyses for the word), a disambiguation procedure based on the word context must be applied. This paper deals with morphological disambiguation of the Hebrew language, which combines morphemes into a word in both agglutinative and fusional ways. We present an unsupervised stochastic model – the only resource we use is a morphological analyzer – which deals with the data sparseness problem caused by the affixational morphology of the Hebrew language. We present a text encoding method for languages with affixational morphology in which the knowledge of word formation rules (which are quite restricted in Hebrew) helps in the disambiguation. We adapt HMM algorithms for learning and searching this text representation, in such a way that segmentation and tagging can be learned in parallel in one step. Results on a large scale evaluation indicate that this learning improves disambiguation for complex tag sets. Our method is applicable to other languages with affix morphology. 1
Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach
, 2007
"... ..."
Choosing an optimal architecture for segmentation and POS-tagging of Modern Hebrew
- In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
, 2005
"... A major architectural decision in designing a disambiguation model for segmentation and Part-of-Speech (POS) tagging in Semitic languages concerns the choice of the input-output terminal symbols over which the probability distributions are defined. In this paper we develop a segmenter and a tagger f ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
A major architectural decision in designing a disambiguation model for segmentation and Part-of-Speech (POS) tagging in Semitic languages concerns the choice of the input-output terminal symbols over which the probability distributions are defined. In this paper we develop a segmenter and a tagger for Hebrew based on Hidden Markov Models (HMMs). We start out from a morphological analyzer and a very small morphologically annotated corpus. We show that a model whose terminal symbols are word segments (=morphemes), is advantageous over a word-level model for the task of POS tagging. However, for segmentation alone, the morpheme-level model has no significant advantage over the word-level model. Error analysis shows that both models are not adequate for resolving a common type of segmentation ambiguity in Hebrew – whether or not a word in a written text is prefixed by a definiteness marker. Hence, we propose a morphemelevel model where the definiteness morpheme is treated as a possible feature of morpheme terminals. This model exhibits the best overall performance, both in POS tagging and in segmentation. Despite the small size of the annotated corpus available for Hebrew, the results achieved using our best model are on par with recent
Hebrew computational linguistics: Past and future
- Artificial Intelligence Review
, 2004
"... This paper reviews the current state of the art in Natural Language Processing for Hebrew, both theoretical and practical. The Hebrew language, like other Semitic languages, poses special challenges for developers of programs for natural language processing: the writing system, rich morphology, uniq ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
This paper reviews the current state of the art in Natural Language Processing for Hebrew, both theoretical and practical. The Hebrew language, like other Semitic languages, poses special challenges for developers of programs for natural language processing: the writing system, rich morphology, unique word formation process of roots and patterns, lack of linguistic corpora that document language usage, all contribute to making computational approaches to Hebrew challenging. The paper briefly reviews the field of computational linguistics and the problems it addresses, describes the special difficulties inherent to Hebrew (as well as to other Semitic languages), surveys a wide variety of past and ongoing works and attempts to characterize future needs and possible solutions. 1
Can you tag the modal? you should
- In Proceeding of COLING-ACL-07
, 2007
"... Computational linguistics methods are typically first developed and tested in English. When applied to other languages, assumptions from English data are often applied to the target language. One of the most common such assumptions is that a “standard” part-of-speech (POS) tagset can be used across ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
(Show Context)
Computational linguistics methods are typically first developed and tested in English. When applied to other languages, assumptions from English data are often applied to the target language. One of the most common such assumptions is that a “standard” part-of-speech (POS) tagset can be used across languages with only slight variations. We discuss in this paper a specific issue related to the definition of a POS tagset for Modern Hebrew, as an example to clarify the method through which such variations can be defined. It is widely assumed
Tagging a Hebrew Corpus: The Case of Participles
"... We report on an effort to build a corpus of Modern Hebrew tagged with parts of speech and morphology. We designed a tagset specific to Hebrew while focusing on 4 aspects: the tagset should be consistent with common linguistic knowledge; there should be maximal agreement among taggers as to the tags ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
We report on an effort to build a corpus of Modern Hebrew tagged with parts of speech and morphology. We designed a tagset specific to Hebrew while focusing on 4 aspects: the tagset should be consistent with common linguistic knowledge; there should be maximal agreement among taggers as to the tags assigned to maintain consistency; the tagset should be useful for machine taggers and learning algorithms; and the tagset should be effective for applications relying on the tags as input features. In this paper, we illustrate these issues by explaining our decision to introduce a tag for participles in Hebrew. We explain how this tag is defined, and how it helped us improve the manual tagging accuracy to a high-level, while improving automatic tagging and helping in the task of syntactic chunking. 1
Morphological Disambiguation of Hebrew
, 2007
"... (Chairman of M.A Committee) 1 Contents 1 ..."
(Show Context)
unknown title
"... Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a m ..."
Abstract
- Add to MetaCart
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a part-of-speech (POS) category. Semitic words may be ambiguous with regard to their segmentation as well as to the POS tags assigned to each segment. When designing POS taggers for Semitic languages, a major architectural decision concerns the choice of the atomic input tokens (terminal symbols). If the tokenization is at the word level, the output tags must be complex, and represent both the segmentation of the word and the POS tag assigned to each word segment. If the tokenization is at the segment level, the input itself must encode the different alternative segmentations of the words, while the output consists of standard POS tags. Comparing these two alternatives is not trivial, as the choice between them may have global effects on the grammatical model. Moreover, intermediate levels of tokenization between these two extremes are conceivable, and, as we aim to show, beneficial. To the best of our knowledge, the problem of tokenization for POS tagging of Semitic languages has not been addressed before in full generality. In this paper, we study this problem for the purpose of POS tagging of Modern Hebrew texts. After extensive error analysis of the two simple tokenization