Results 1 -
4 of
4
Specific Features to Enhance Arabic Named Entity Recognition
, 2008
"... Abstract: The Named entity recognition task has been garnering significant attention as it has been shown to help improve the performance of many natural language processing applications. More recently, we are starting to see a surge in developing named entity recognition systems for languages other ..."
Abstract
- Add to MetaCart
Abstract: The Named entity recognition task has been garnering significant attention as it has been shown to help improve the performance of many natural language processing applications. More recently, we are starting to see a surge in developing named entity recognition systems for languages other than English. With the relative abundance of resources for the Arabic language and a certain degree of maturation in the state of the art for processing Arabic, it is natural to see interest in developing NER systems for the language. In this paper, we investigate the impact of using different sets of features that are both language independent and language specific in a discriminative machine learning framework, namely, Support Vector Machines. We explore lexical, contextual and morphological features and nine data-sets of different genres and annotations. We systematically measure the impact of the different features in isolation and combined. We achieve the highest performance using a combination of all features, F1=82.71. Essentially combining language independent features with language specific ones yields the best performance on all the genres of text we investigate. However, on a class level, we observe that the different classes of named entities benefit differently from the morphological features employed.
Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic
"... We present a working Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Re ..."
Abstract
- Add to MetaCart
We present a working Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabic, but- instead- a highly multilingual, almost language-independent NER system was adapted to also cover Arabic. The Semitic language Arabic substantially differs from the Indo-European and Finno-Ugric languages currently covered. This paper thus describes what Arabic language-specific resources had to be developed and what changes needed to be made to the otherwise language-independent rule set in order to be applicable to the Arabic language. The achieved evaluation results are generally satisfactory, but could be improved for certain entity types. 1.
Personal Name Resolution in Email: A Heuristic Approach
, 2008
"... Much of the work to date on searching email has focused on personal information management. Archival access poses new challenges, including automatic association of references to unfamiliar individuals using whatever information is available about those people. This paper describes a computational a ..."
Abstract
- Add to MetaCart
Much of the work to date on searching email has focused on personal information management. Archival access poses new challenges, including automatic association of references to unfamiliar individuals using whatever information is available about those people. This paper describes a computational approach to that task motivated by intuitions about the ways people might explore an email collection to find that information. The proposed approach makes use of context in a flexible and adaptive manner. Two techniques for context expansion are: a mixture model that combines evidence from each context to rank candidates, and cutoff model that ranks candidates based on the closest context in which any suitable evidence was found. Both models rely on mentions that could be resolved to a common identity as evidence of the resolution. Results on three relatively small collections indicate that the accuracy of our approach performs favorable compared to the best known technique and results on the full CMU Enron collection indicate that the approach presented in this paper scales well to larger email collections.
A Rule-Based Approach for Tagging Non-Vocalized Arabic Words
, 2008
"... Abstract: In this work, we present a tagging system which classifies the words in a non-vocalized Arabic text to their tags. The proposed tagging system passes through three levels of analysis. The first level is a lexical analyzer that composed of a lexicon containing all fixed words and particles ..."
Abstract
- Add to MetaCart
Abstract: In this work, we present a tagging system which classifies the words in a non-vocalized Arabic text to their tags. The proposed tagging system passes through three levels of analysis. The first level is a lexical analyzer that composed of a lexicon containing all fixed words and particles such as prepositions and pronouns. The second level is a morphological analyzer which relies on word structure using patterns and affixes to determine word class. The third level is a syntax analyzer or a grammatical tagging which relies on the process of assigning grammatical tags to words based on their context or the position of the word in the sentence. The syntax analyzer level consists of two stages: the first stage depends on specific keywords that inform the tag of the successive word, the second stage is the reversed parsing technique which scans the available grammars of Arabic language to get the class of a single ambiguity word in the sentence. We have tested the proposed system on a corpus consists of 2355 words. Experimental results showed that the proposed system achieved a rate of success approaching 94 % of the total number of words in the sample used in the study. Keywords: Part-of-speech tagging, lexical analyzer, morphological analyzer, Arabic language processing.

