Results 1 - 10
of
11
Head-Driven Statistical Models for Natural Language Parsing
, 2003
"... This article describes three statistical models for natural language parsing. The models extend methods from probabilistic context-free grammars to lexicalized grammars, leading to approaches in which a parse tree is represented as the sequence of decisions corresponding to a head-centered, top-down ..."
Abstract
-
Cited by 780 (13 self)
- Add to MetaCart
This article describes three statistical models for natural language parsing. The models extend methods from probabilistic context-free grammars to lexicalized grammars, leading to approaches in which a parse tree is represented as the sequence of decisions corresponding to a head-centered, top-down derivation of the tree. Independence assumptions then lead to parameters that encode the X-bar schema, subcategorization, ordering of complements, placement of adjuncts, bigram lexical dependencies, wh-movement, and preferences for close attachment. All of these preferences are expressed by probabilities conditioned on lexical heads. The models are evaluated on the Penn Wall Street Journal Treebank, showing that their accuracy is competitive with other models in the literature. To gain a better understanding of the models, we also give results on different constituent types, as well as a breakdown of precision/recall results in recovering various types of dependencies. We analyze various characteristics of the models through experiments on parsing accuracy, by collecting frequencies of various structures in the treebank, and through linguistically motivated examples. Finally, we compare the models to others that have been applied to parsing the treebank, aiming to give some explanation of the difference in performance of the various models
Automating the Construction of Internet Portals with Machine Learning
- Information Retrieval
, 2000
"... Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible ..."
Abstract
-
Cited by 141 (3 self)
- Add to MetaCart
Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Multi-label text classification with a mixture model trained by EM
- AAAI 99 Workshop on Text Learning
, 1999
"... In many important document classification tasks, documents may each be associated with multiple class labels. This paper describes a Bayesian classification approach in which the multiple classes that comprise a document are represented by a mixture model. While the labeled training data indicates w ..."
Abstract
-
Cited by 95 (3 self)
- Add to MetaCart
In many important document classification tasks, documents may each be associated with multiple class labels. This paper describes a Bayesian classification approach in which the multiple classes that comprise a document are represented by a mixture model. While the labeled training data indicates which classes were responsible for generating a document, it does not indicate which class was responsible for generating each word. Thus we use EM to fill in this missing value, learning both the distribution over mixture weights and the word distribution in each class's mixture component. We describe the benefits of this model and present preliminary results with the Reuters-21578 data set.
Extracting the Names of Genes and Gene Products with a Hidden Markov Model
, 2000
"... We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical ter- minology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatic ..."
Abstract
-
Cited by 92 (4 self)
- Add to MetaCart
We report the results of a study into the use of a linear interpolating hidden Markov model (HMM) for the task of extracting technical ter- minology from MEDLINE abstracts and texts in the molecular-biology domain. This is the first stage in a system that will extract event information for automatically updating biology databases. We trained the HMM entirely with bigrams based on lexical and character features in a relatively small corpus of 100 MEDLINE abstracts that were marked-up by domain experts with term classes such as proteins and DNA. Using cross-validation methods we achieved an F-score of 0.73 and we examine the contribution made by each part of the interpolation model to overcoming data sparseness.
Information Extraction with HMM Structures Learned by Stochastic Optimization
- In Proceedings of the Seventeenth National Conference on Artificial Intelligence
, 2000
"... Recent research has demonstrated the strong performance of hidden Markov models applied to information extraction -- the task of populating database slots with corresponding phrases from text documents. A remaining problem, however, is the selection of state-transition structure for the model. This ..."
Abstract
-
Cited by 89 (2 self)
- Add to MetaCart
Recent research has demonstrated the strong performance of hidden Markov models applied to information extraction -- the task of populating database slots with corresponding phrases from text documents. A remaining problem, however, is the selection of state-transition structure for the model. This paper demonstrates that extraction accuracy strongly depends on the selection of structure, and presents an algorithm for automatically finding good structures by stochastic optimization. Our algorithm begins with a simple model and then performs hill-climbing in the space of possible structures by splitting states and gauging performance on a validation set. Experimental results show that this technique finds HMM models that almost always out-perform a fixed model, and have superior average performance across tasks.
A Machine Learning Approach to Building Domain-Specific Search Engines
- In Proceedings of the 16th International Joint Conference on Artificial Intelligence
, 1999
"... Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also difficult and time-consuming to maintain. This paper proposes the use of machine learning techniq ..."
Abstract
-
Cited by 68 (3 self)
- Add to MetaCart
Domain-specific search engines are becoming increasingly popular because they offer increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also difficult and time-consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, text classification and information extraction that enables efficient spidering, populates topic hierarchies, and identifies informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justresearch.com.
Applying Machine Learning for High Performance Named-Entity Extraction
, 1999
"... This paper describes a machine learning approach to build an efficient, accurate and fast name spotting system. Finding names in free text is an important task in addressing real-world text-based applications. Most previous approaches have been based on carefully hand-crafted modules encoding lingui ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
This paper describes a machine learning approach to build an efficient, accurate and fast name spotting system. Finding names in free text is an important task in addressing real-world text-based applications. Most previous approaches have been based on carefully hand-crafted modules encoding linguistic knowledge specific to the language and document genre. Such approaches have two drawbacks: they require large amounts of time and linguistic expertise to develop, and they are not easily portable to new languages and genres. This paper describes an extensible system which automatically combines weak evidence for name extraction. This evidence is gathered from easily available sources: part-of-speech tagging, dictionary lookups, and textual information such as capitalization and punctuation. Individually, each piece of evidence is insuFFIcient for robust name detection. However, the combination of evidence, through standard machine learning techniques, yields a system that achieves performance equivalent to the best existing hand-crafted approaches.
A Survey of Emerging Trend Detection in Textual Data Mining
, 2003
"... In this chapter we describe several systems that detect emerging trends in textual data. Some of the systems are semi-automatic, requiring user input to begin processing, others are fully-automatic, producing output from the input corpus without guidance. For each Emerging Trend Detection (ETD) syst ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
In this chapter we describe several systems that detect emerging trends in textual data. Some of the systems are semi-automatic, requiring user input to begin processing, others are fully-automatic, producing output from the input corpus without guidance. For each Emerging Trend Detection (ETD) system we describe components including linguistic and statistical features, learning algorithms, training and test set generation, visualization and evaluation. We also provide a brief overview of several commercial products with capabilities for detecting trends in textual data, followed by an industrial viewpoint describing the importance of trend detection tools, and an overview of how such tools are used.
A Rule-Based Named Entity Recognition System for Speech Input
- In Proceedings of the International Conference on Spoken Language Processing
, 2000
"... In this paper, we propose a rule based (transformation based) named entity recognition system which uses the Brill rule inference approach. To measure its performance, we compare the performance of the rule-based system and IdentiFinder, one of the most successful stochastic systems. In the baseline ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
In this paper, we propose a rule based (transformation based) named entity recognition system which uses the Brill rule inference approach. To measure its performance, we compare the performance of the rule-based system and IdentiFinder, one of the most successful stochastic systems. In the baseline case (no punctuation and no capitalisation), both systems show almost equal performance. They also have similar performance in the case of additional information such as punctuation, capitalisation and name lists. The performance of both systems degrade linearly with added speech recognition errors, and their rates of degradation are almost equal. These results show that automatic rule inference is a viable alternative to the HMM-based approach to named entity recognition, but it retains the advantages of a rule-based approach.
Normalization of Non-Standard Words: WS '99 Final Report
- Hopkins University
, 1999
"... All areas of language and speech technology must deal, in one way or another, with real text. Real text is messy: many things one nds in text | numbers, abbreviations, dates, currency amounts, acronyms . . . | are not standard words in that one cannot nd their properties by looking them up in a ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
All areas of language and speech technology must deal, in one way or another, with real text. Real text is messy: many things one nds in text | numbers, abbreviations, dates, currency amounts, acronyms . . . | are not standard words in that one cannot nd their properties by looking them up in a dictionary or deriving them morphologically from words that are in a dictionary, nor can one nd their pronunciation by an application of \letter-to-sound" rules. For many applications, such non-standard words | NSW's | need to be normalized, or in other words converted into standard words. Since the correct normalization of a given token often depends upon both the local context and the type (genre) of text one is dealing with, \text-normalization" is in general a very hard problem. Typical technology for text-normalization mostly involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text), with the expected result that the techniques, do...

