Results 1 - 10
of
42
Automatic Combination of Multiple Ranked Retrieval Systems
, 1994
"... Retrieval performance can often be improved significantly by using a number of different retrieval algorithms and combining the results, in contrast to using just a single retrieval algorithm. This is because different retrieval algorithms, or retrieval experts, often emphasize different document an ..."
Abstract
-
Cited by 130 (5 self)
- Add to MetaCart
Retrieval performance can often be improved significantly by using a number of different retrieval algorithms and combining the results, in contrast to using just a single retrieval algorithm. This is because different retrieval algorithms, or retrieval experts, often emphasize different document and query features when determining relevance and therefore retrieve different sets of documents. However, it is unclear how the different experts are to be combined, in general, to yield a superior overall estimate. We propose a method by which the relevance estimates made by different experts can be automatically combined to result in superior retrieval performance. We apply the method to two expert combination tasks. The applications demonstrate that the method can identify high performance combinations of experts and also is a novel means for determining the combined effectiveness of experts. 1 Introduction In text retrieval, two heads are definitely better than one. Retrieval performanc...
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Feature Selection and Feature Extraction for Text Categorization
- In Proceedings of Speech and Natural Language Workshop
, 1992
"... The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The opt ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features) despite the large training sets. The extraction of new text features by syntactic analysis and feature clustering was investigated on the Reuters data set. Syntactic indexing phrases, clusters of these phrases, and clusters of words were all found to provide less effective representations than individual words. 1. Introduction Text categorization---the automated assigning of natural language texts to predefined categories based on their content---is a task of increasing importance. Its applications include indexing texts to support document retrieval [1], extracting data from texts [2], and ai...
Feature generation for text categorization using world knowledge
- In IJCAI’05
, 2005
"... We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies a ..."
Abstract
-
Cited by 62 (13 self)
- Add to MetaCart
We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field. 1
Discovering Trends in Text Databases
, 1997
"... We describe a system we developed for identifying trends in text documents collected over a period of time. Trends can be used, for example, to discover that a company is shifting interests from one domain to another. Our system uses several data mining techniques in novel ways and demonstrates a me ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We describe a system we developed for identifying trends in text documents collected over a period of time. Trends can be used, for example, to discover that a company is shifting interests from one domain to another. Our system uses several data mining techniques in novel ways and demonstrates a method in which to visualize the trends. We also give experiences from applying this system to the IBM Patent Server, a database of U.S. patents. Introduction We address the problem of discovering trends in text databases. We are given a database D of documents. Each document consists of one or more text fields and a timestamp. The unit of text is a word and a phrase is a list of words. (We defer the discussion of more complex structures till the "Methodology" section.) Associated with each phrase is a history of the frequency of occurrence of the phrase, obtained by partitioning the documents based upon their timestamps. The frequency of occurrence in a particular time period is the number o...
Natural Language Information Retrieval: TREC-3 Report
- In Proceedings of the Fifth Text REtrieval Conference (TREC-5
"... In this paper we report on the recent developments in NYU's natural language information retrieval system, especially as related to the 3rd Text Retrieval Conference (TREC-3). The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of te ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
In this paper we report on the recent developments in NYU's natural language information retrieval system, especially as related to the 3rd Text Retrieval Conference (TREC-3). The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of term-based document retrieval. The system is designed around a traditional statistical backbone consisting of the indexer module, which builds inverted index files from pre-processed documents, and a retrieval engine which searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract content-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. For the present TREC-3 effort, the total of 3.3 GBytes of text articles have been processed (Tipster disks 1 through 3), i...
Optimizing Ranking Functions: A Connectionist Approach to Adaptive Information Retrieval
- DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, THE UNIVERSITY OF CALIFORNIA, SAN DIEGO
, 1994
"... This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the document ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
This dissertation examines the use of adaptive methods to automatically improve the performance of ranked text retrieval systems. The goal of a ranked retrieval system is to manage a large collection of text documents and to order documents for a user based on the estimated relevance of the documents to the user's information need (or query). The ordering enables the user to quickly find documents of interest. Ranked retrieval is a difficult problem because of the ambiguity of natural language, the large size of the collections, and because of the varying needs of users and varying collection characteristics. We propose and empirically validate general adaptive methods which improve the ability of a large class of retrieval systems to rank documents effectively. Our main adaptive method is to numerically optimize free parameters in a retrieval system by minimizing a non-metric criterion function. The criterion measures how well the system is ranking documents relative to a target ordering, defined by a set of training queries which include the users' desired document orderings. Thus, the system learns parameter settings which better enable it to rank relevant documents before irrelevant. The non-metric approach is interesting because it is a general adaptive method, an alternative to supervised methods for training neural networks in domains in which rank order or prioritization is important. A second adaptive method is also examined, which is applicable to a restricted class of retrieval systems but which permits an analytic solution. The adaptive methods are applied to a number of problems in text retrieval to validate their utility and practical efficiency. The applications include: A dimensionality reduction of vector-based document representations to a vector spa...
Learning in Intelligent Information Retrieval
- In Proceedings of the Eighth International Workshop on Machine Learning
, 1991
"... Information retrieval (IR) systems are used for finding, within a large text database, those documents containing information needed by a user. The complex and poorly understood semantics of documents and user queries has made feedback and adaptation important characteristics of IR systems. In this ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Information retrieval (IR) systems are used for finding, within a large text database, those documents containing information needed by a user. The complex and poorly understood semantics of documents and user queries has made feedback and adaptation important characteristics of IR systems. In this paper we briefly survey previous research on machine learning in IR systems and discuss promising areas for future research at the intersection of these two fields. 1 Introduction The goal of information retrieval (IR) techniques is to find, within a large database of documents, those documents which satisfy a user information need. Typically the stored documents are composed of natural language text, though IR techniques have also been applied to databases of stored speech, images, computer source code, and other forms of information. In contrast to conventional database techniques, IR techniques are most useful when the semantics of the objects to be retrieved is unclear, and the relation...
Information Retrieval Using Robust Natural Language Processing
, 1992
"... We developed a prototype information retrieval system which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval. The backbone of our system is a statistical retrieval engine which performs automated indexing of docum ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
We developed a prototype information retrieval system which uses advanced natural language processing techniques to enhance the effectiveness of traditional key-word based document retrieval. The backbone of our system is a statistical retrieval engine which performs automated indexing of documents, then search and ranking in response to user queries.

