Results 1 - 10
of
217
Latent dirichlet allocation
- Journal of Machine Learning Research
, 2003
"... We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, ..."
Abstract
-
Cited by 4365 (92 self)
- Add to MetaCart
(Show Context)
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model. 1.
An Information-Theoretic Definition of Similarity
- In Proceedings of the 15th International Conference on Machine Learning
, 1998
"... Similarity is an important and widely used concept. Previous definitions of similarity are tied to a particular application or a form of knowledge representation. We present an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model. We demonstrate ..."
Abstract
-
Cited by 1243 (0 self)
- Add to MetaCart
Similarity is an important and widely used concept. Previous definitions of similarity are tied to a particular application or a form of knowledge representation. We present an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model. We demonstrate how our definition can be used to measure the similarity in a number of different domains.
TextTiling: Segmenting text into multi-paragraph subtopic passages
- Computational Linguistics
, 1997
"... TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation t ..."
Abstract
-
Cited by 458 (2 self)
- Add to MetaCart
(Show Context)
TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including information retrieval and summarization. 1.
Query expansion using lexical-semantic relations
- In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1994
"... Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri sug ..."
Abstract
-
Cited by 395 (1 self)
- Add to MetaCart
Applications such as office automation, news filtering, help facilities in complex systems, and the like require the ability to retrieve documents from full-text databases where vocabulary problems can be particularly severe. Experiments performed on small collections with single-domain thesauri suggest that expanding query vectors with words that are lexically related to the original query words can ameliorate some of the problems of mismatched vocabularies. This paper examines the utility of lexical query expansion in the large, diverse TREC collection. Concepts are represented by WordNet synonym sets and are expanded by following the typed links included in Word Net. Experimental results show this query expansion technique makes little difference in retrieval effectiveness if the original queries are relatively complete descriptions of the information being sought even when the concepts to be expanded are selected by hand. Less well developed queries can be significantly improved by expansion of hand-chosen concepts. However, an automatic procedure that can approximate the set of hand picked synonym sets has yet to be devised, and expanding by the synonym sets that are automatically generated can degrade retrieval performance. 1
TileBars: Visualization of Term Distribution Information in Full Text Information Access
, 1995
"... The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full te ..."
Abstract
-
Cited by 341 (10 self)
- Add to MetaCart
The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full text documents, and presents a visualization paradigm, called TileBars, that demonstrates the usefulness of explicit term distribution information in Boolean-type queries. TileBars simultaneously and compactly indicate relative document length, query term frequency, and query term distribution. The patterns in a column of TileBars can be quickly scanned and deciphered, aiding users in making judgments about the potential relevance of the retrieved documents. KEYWORDS: Information retrieval, Full-length text, Visualization. INTRODUCTION Information access systems have traditionally focused on retrieval of documents consisting of titles and abstracts. As a consequence, the underlying assumpt...
Modern information retrieval: a brief overview
- BULLETIN OF THE IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON DATA ENGINEERING
, 2001
"... For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) wa ..."
Abstract
-
Cited by 236 (0 self)
- Add to MetaCart
(Show Context)
For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity. The field of Information Retrieval (IR) was born in the 1950s out of this necessity. Over the last forty years, the field has matured considerably. Several IR systems are used on an everyday basis by a wide variety of users. This article is a brief overview of the key advances in the field of Information Retrieval, and a description of where the state-of-the-art is at in the field.
A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems
- ACM Transactions on Information Systems
, 1994
"... We present a probabilistic relational algebra (PRA) which is a generalization of standard relational algebra. Here tuples are assigned probabilistic weights giving the probability that a tuple belongs to a relation. Based on intensional semantics, the tuple weights of the result of a PRA expression ..."
Abstract
-
Cited by 212 (34 self)
- Add to MetaCart
We present a probabilistic relational algebra (PRA) which is a generalization of standard relational algebra. Here tuples are assigned probabilistic weights giving the probability that a tuple belongs to a relation. Based on intensional semantics, the tuple weights of the result of a PRA expression always confirm to the underlying probabilistic model. We also show for which expressions extensional semantics yields the same results. Furthermore, we discuss complexity issues and indicate possibilities for optimization. With regard to databases, the approach allows for representing imprecise attribute values, whereas for information retrieval, probabilistic document indexing and probabilistic search term weighting can be modelled. As an important extension, we introduce the concept of vague predicates which yields a probabilistic weight instead of a Boolean value, thus allowing for queries with vague selection conditions. So PRA implements uncertainty and vagueness in combination with the...
An Association Thesaurus for Information Retrieval
- In RIAO 94 Conference Proceedings
, 1994
"... Although commonly used in both commercial and experimental information retrieval systems, thesauri have not demonstrated consistent benefits for retrieval performance, and it is difficult to construct a thesaurus automatically for large text databases. In this paper, an approach, called PhraseFinder ..."
Abstract
-
Cited by 182 (11 self)
- Add to MetaCart
(Show Context)
Although commonly used in both commercial and experimental information retrieval systems, thesauri have not demonstrated consistent benefits for retrieval performance, and it is difficult to construct a thesaurus automatically for large text databases. In this paper, an approach, called PhraseFinder, is proposed to construct collection-dependent association thesauri automatically using large full-text document collections. The association thesaurus can be accessed through natural language queries in INQUERY, an information retrieval system based on the probabilistic inference network. Experiments are conducted in INQUERY to evaluate different types of association thesauri, and thesauri constructed for a variety of collections. 1 Introduction A thesaurus is a set of items ( phrases or words ) plus a set of relations between these items. Although thesauri are commonly used in both commercial and experimental IR systems, experiments have shown inconsistent effects on retrieval effectiven...
The Effect of Adding Relevance Information in a Relevance Feedback Environment
, 1994
"... The effects of adding information from relevant documents are examined in the TREC routing environment. A modified Rocchio relevance feedback approach is used, with a varying number of relevant documents retrieved by an initial SMART search, and a varying number of terms from those relevant document ..."
Abstract
-
Cited by 180 (6 self)
- Add to MetaCart
The effects of adding information from relevant documents are examined in the TREC routing environment. A modified Rocchio relevance feedback approach is used, with a varying number of relevant documents retrieved by an initial SMART search, and a varying number of terms from those relevant documents used to expand the initial query. Recall-precision evaluation reveals that as the amount of expansion of the query due to adding terms from relevant documents increases, so does the effectiveness. It is observed for this particular experiment that there seems to be a linear relationship between the log of the number of terms added and the recall-precision effectiveness. There also seems to be a linear relationship between the log of the number of known relevant documents and the recall-precision effectiveness. 1 Introduction Relevance feedback is a commonly accepted method of improving interactive retrieval effectiveness.[1, 2] An initial search is made by the system with a user-supplied ...
Indexing with WordNet synsets can improve text retrieval
, 1998
"... The classical, vector space model for text retrieval is shown to give better results (up to 29% better in our experiments) ff WordNet synsets are chosen as the indexing space, instead of word forms. This resuit is obtained for a manually disambiguated test collection (of queries and documents) deriv ..."
Abstract
-
Cited by 174 (4 self)
- Add to MetaCart
(Show Context)
The classical, vector space model for text retrieval is shown to give better results (up to 29% better in our experiments) ff WordNet synsets are chosen as the indexing space, instead of word forms. This resuit is obtained for a manually disambiguated test collection (of queries and documents) derived from the SEMCOR semantic concordance. The sensitiv- ity of retrieval performance to (automatic) disambiguation errors when indexing documents is also measured. Finally, it is observed that ff queries are not disambiguated, indexing by synsets performs (at best) only as good as standard word indexing.