Results 1 - 10
of
19
INQUERY System Overview
- In Proceedings of the TIPSTER Text Program (Phase I
, 1994
"... such as words, phrases, paragraphs, or manually assigned keywords) and different versions of the query (such as natural language and Boolean) can be combined in a consistent probabilistic framework. This type of "data fusion" has been known to be effective in the information retrieval context for a ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
such as words, phrases, paragraphs, or manually assigned keywords) and different versions of the query (such as natural language and Boolean) can be combined in a consistent probabilistic framework. This type of "data fusion" has been known to be effective in the information retrieval context for a number of years, and was one of the primary motivations for developing the inference net approach. Another feature of the inference net approach is the ability to capture complex structure in the network representing the information need (i.e. the query). A practical consequence of this is that complex Boolean queries can be evaluated as easily as natural language queries and produce ranked output. It is also possible to represent "rule-based" or "concept-based" queries in the same probabilistic framework. This has led to us concentrating on automatic analysis of queries and techniques for enhancing queries rather than on in-depth analysis of the documents in the database. In general, it is
Recognizing Acronyms and their Definitions
- ISRI (Information Science Research Institute) UNLV
, 1999
"... Abstract This paper introduces an automatic method for finding acronyms and their definitions in free text. The method is based on an inexact pattern matching algorithm applied to text surrounding the possible acronym. Evaluation shows both high recall and precision for a set of documents randomly s ..."
Abstract
-
Cited by 35 (0 self)
- Add to MetaCart
Abstract This paper introduces an automatic method for finding acronyms and their definitions in free text. The method is based on an inexact pattern matching algorithm applied to text surrounding the possible acronym. Evaluation shows both high recall and precision for a set of documents randomly selected from a larger set of full text documents. \Lambda
Text Mining with Information Extraction
- AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases
, 2002
"... The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrat ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems. We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other
An exploration of entity models, collective classification and relation description
- In Proceedings of KDD Workshop on Link Analysis and Group Detection
, 2004
"... Traditional information retrieval typically represents data using a bag of words; data mining typically uses a highly structured database representation. This paper explores the middle ground using a representation which we term entity models, in which questions about structured data may be posed an ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Traditional information retrieval typically represents data using a bag of words; data mining typically uses a highly structured database representation. This paper explores the middle ground using a representation which we term entity models, in which questions about structured data may be posed and answered, but the complexities and task-specific restrictions of ontologies are avoided. An entity model is a language model or word distribution associated with an entity, such as a person, place or organization. Using these perentity language models, entities may be clustered, links may be detected or described with a short summary, entities may be collectively classified, and question answering may be performed. On a corpus of entities extracted from newswire and the Web, we group entities by profession with 90 % accuracy, improve accuracy further on the task of classifying politicians as liberal or conservative using collective classification and conditional random fields, and answer questions about “who a person is ” with mean reciprocal rank (MRR) of 0.52. 1.
Capturing Term Dependencies using a Sentence Tree based Language Model
, 2002
"... We describe a new probabilistic Sentence Tree Language Modeling approach that captures term dependency patterns in Topic Detection and Tracking's (TDT) Story Link Detection task. New features of the approach include modeling the syntactic structure of sentences in documents by a sentence-bin approac ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We describe a new probabilistic Sentence Tree Language Modeling approach that captures term dependency patterns in Topic Detection and Tracking's (TDT) Story Link Detection task. New features of the approach include modeling the syntactic structure of sentences in documents by a sentence-bin approach and a computationally efficient algorithm for capturing the most significant sentence level term dependencies using a Maximum Spanning Tree approach, similar to Van Rijsbergen's modeling of document-level term dependencies.
Applications of Machine Learning in Information Retrieval
, 1997
"... Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can uti ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can utilize user feedback on selected documents to indicate the accuracy of
Dynamic Composition of Information Retrieval Techniques
- Journal of Intelligent Information Systems
, 2004
"... This paper presents a new approach to information retrieval (IR) based on run-time selection of the best set of techniques to respond to a given query. A technique is selected based on its projected effectiveness with respect to the specific query, the load on the system, and a time-dependent utilit ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
This paper presents a new approach to information retrieval (IR) based on run-time selection of the best set of techniques to respond to a given query. A technique is selected based on its projected effectiveness with respect to the specific query, the load on the system, and a time-dependent utility function. The paper examines two fundamental questions: (1) can the selection of the best IR techniques be performed at run-time with minimal computational overhead? and (2) is it possible to construct a reliable probabilistic model of the performance of an IR technique that is conditioned on the characteristics of the query? We show that both of these questions can be answered positively. These results suggest a new system design that carries a great potential to improve the quality of service of future IR systems.
TREC2005 Enterprise Track Experiments at BUPT
"... Abstract. This paper introduces and analyzes some experiments to find valid methods and features in enterprise search. For this purpose, two main experiments have been done. One is to retrieve some emails which contain the required information in all the emails of an enterprise, and the other is to ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. This paper introduces and analyzes some experiments to find valid methods and features in enterprise search. For this purpose, two main experiments have been done. One is to retrieve some emails which contain the required information in all the emails of an enterprise, and the other is to try to find some experts who are helpful in a particular fields. Some features of the intranet dataset, such as the subject, the author, the date and the thread, are proved to be useful when searching an email. A new two-stage rank method which is different from traditional IR is introduced for expert search. 1.
A comparison of feature selection methods for an evolving RSS feed corpus
- Information Processing & Management
, 2006
"... Previous researchers have attempted to detect significant topics in news stories and blogs through the use of word frequency-based methods applied to RSS feeds. In this paper, the three statistical feature selection methods: χ 2, Mutual Information (MI) and Information Gain (I) are proposed as alter ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Previous researchers have attempted to detect significant topics in news stories and blogs through the use of word frequency-based methods applied to RSS feeds. In this paper, the three statistical feature selection methods: χ 2, Mutual Information (MI) and Information Gain (I) are proposed as alternative approaches for ranking term significance in an evolving RSS feed corpus. The extent to which the three methods agree with each other on determining the degree of the significance of a term on a certain date is investigated as well as the assumption that larger values tend to indicate more significant terms. An experimental evaluation was carried out with 39 different levels of data reduction to evaluate the three methods for differing degrees of significance. The three methods showed a significant degree of disagreement for a number of terms assigned an extremely large value. Hence, the assumption that the larger a value, the higher the degree of the significance of a term should be treated cautiously. Moreover, MI and I show significant disagreement. This suggests that MI is different in the way it ranks significant terms, as MI does not take the absence of a term into account, although I does. I, however, has a higher degree of term reduction than MI and χ 2. This can result in loosing some significant terms. In summary, χ 2 seems to be the best method to determine term significance for RSS feeds, as χ 2 identifies both types of significant behavior. The χ 2 method, however, is far from perfect as an extremely high value can be assigned to relatively insignificant terms.
Entity Models: Construction and Applications
"... We propose entity language models, a probabilistic representation of the language used to describe a named entity (person, organization, or location). The model is purely statistical and constructed from snippets of text surrounding mentions of an entity. We evaluate the effectiveness of entity mode ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We propose entity language models, a probabilistic representation of the language used to describe a named entity (person, organization, or location). The model is purely statistical and constructed from snippets of text surrounding mentions of an entity. We evaluate the effectiveness of entity models in three tasks: fact-based question answering, classification into pre-defined groups, and description of the relationship between two entities. The results on all tasks are promising.

