• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Statistical language modeling for information retrieval (2005)

by WB Croft, X Liu
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 119
Next 10 →

Corpus Structure, Language Models, and Ad Hoc Information Retrieval

by Oren Kurland, Lillian Lee
"... Most previous work on the recently developed languagemodeling approach to information retrieval focuses on document -speci c characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by ..."
Abstract - Cited by 43 (12 self) - Add to MetaCart
Most previous work on the recently developed languagemodeling approach to information retrieval focuses on document -speci c characteristics, and therefore does not take into account the structure of the surrounding corpus. We propose a novel algorithmic framework in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Using this framework, we develop a suite of new algorithms. Even the simplest typically outperforms the standard language-modeling approach in precision and recall, and our new interpolation algorithm posts statistically signi cant improvements for both metrics over all three corpora tested.

A GENERATIVE THEORY OF RELEVANCE

by Victor Lavrenko , 2004
"... ..."
Abstract - Cited by 38 (1 self) - Add to MetaCart
Abstract not found

discriminant model for information retrieval

by Jianfeng Gao, Haoliang Qi, Xinsong Xia, Jian-yun Nie - In the Proceedings of SIGIR’2005 , 2005
"... This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic featu ..."
Abstract - Cited by 38 (12 self) - Add to MetaCart
This paper presents a new discriminative model for information retrieval (IR), referred to as linear discriminant model (LDM), which provides a flexible framework to incorporate arbitrary features. LDM is different from most existing models in that it takes into account a variety of linguistic features that are derived from the component models of HMM that is widely used in language modeling approaches to IR. Therefore, LDM is a means of melding discriminative and generative models for IR. We present two algorithms of parameter learning for LDM. One is to optimize the average precision (AP) directly using an iterative procedure. The other is a perceptron-based algorithm that minimizes the number of discordant document-pairs in a rank list. The effectiveness of our approach has been evaluated on the task of ad hoc retrieval using six English and Chinese TREC test sets. Results show that (1) in most test sets, LDM significantly outperforms the state-of-the-art language modeling approaches and the classical probabilistic retrieval model; (2) it is more appropriate to train LDM using a measure of AP rather than likelihood if the IR system is graded on AP; and (3) linguistic features (e.g. phrases and dependences) are effective for IR if they are incorporated properly.

A Risk Minimization Framework for Information Retrieval

by ChengXiang Zhai , John Lafferty - IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM , 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract - Cited by 36 (1 self) - Add to MetaCart
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.

Using Temporal Profiles of Queries for Precision Prediction

by Fernando Diaz, Rosie Jones , 2004
"... A key missing component in information retrieval systems is self-diagnostic tests to establish whether the system can provide reasonable results for a given query on a document collection. If we can measure properties of a retrieved set of documents which allow us to predict average precision, we ca ..."
Abstract - Cited by 35 (5 self) - Add to MetaCart
A key missing component in information retrieval systems is self-diagnostic tests to establish whether the system can provide reasonable results for a given query on a document collection. If we can measure properties of a retrieved set of documents which allow us to predict average precision, we can automate the decision of whether to elicit relevance feedback, or modify the retrieval system in other ways. We use meta-data attached to documents in the form of time stamps to measure the distribution of documents retrieved in response to a query, over the time domain, to create a temporal profile for a query. We define some useful features over this temporal profile. We find that using these temporal features, together with the content of the documents retrieved, we can improve the prediction of average precision for a query.

From frequency to meaning : Vector space models of semantics

by Peter D. Turney, Patrick Pantel - Journal of Artificial Intelligence Research , 2010
"... Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are begi ..."
Abstract - Cited by 34 (0 self) - Add to MetaCart
Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term–document, word–context, and pair–pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field. 1.

Regularizing ad hoc retrieval scores

by Fernando Diaz , 2005
"... The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regulariz ..."
Abstract - Cited by 31 (1 self) - Add to MetaCart
The cluster hypothesis states: closely related documents tend to be relevant to the same request. We exploit this hypothesis directly by adjusting ad hoc retrieval scores from an initial retrieval so that topically related documents receive similar scores. We refer to this process as score regularization. Score regularization can be presented as an optimization problem, allowing the use of results from semisupervised learning. We demonstrate that regularized scores consistently and significantly rank documents better than un-regularized scores, given a variety of initial retrieval algorithms. We evaluate our method on two large corpora across a substantial number of topics.

Integrating DB and IR technologies: What is the sound of one hand clapping

by Surajit Chaudhuri, Raghu Ramakrishnan, Gerhard Weikum - In CIDR , 2005
"... Databases (DB) and information retrieval (IR) have evolved as separate fields. However, modern applications such as customer support, health care, and digital libraries require capabilities for both data and text management. In such settings, traditional DB queries, in SQL or XQuery, are not flexibl ..."
Abstract - Cited by 29 (0 self) - Add to MetaCart
Databases (DB) and information retrieval (IR) have evolved as separate fields. However, modern applications such as customer support, health care, and digital libraries require capabilities for both data and text management. In such settings, traditional DB queries, in SQL or XQuery, are not flexible enough to handle applicationspecific scoring and ranking. IR systems, on the other hand, lack efficient support for handling structured parts of the data and metadata, and do not give the application developer adequate control over the ranking function. This paper analyzes the requirements of advanced text- and data-rich applications for an integrated platform. The core functionality must be manageable, and the API should be easy to program against. A particularly important issue that we highlight is how to reconcile flexibility in scoring and ranking models with optimizability, in order to accommodate a wide variety of target applications efficiently. We discuss whether such a system needs to be designed from scratch, or can be incrementally built on top of existing architectures. The results of our analyses are cast into a series of challenges to the DB and IR communities.

Embedding web-based statistical translation models in cross-language information retrieval

by Wessel Kraaij, Jian-yun Nie, Michel Simard - Computational Linguistics , 2003
"... Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since cu ..."
Abstract - Cited by 29 (3 self) - Add to MetaCart
Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since current models for information retrieval (IR) are still based on a bag of words. The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost. 1.

Risk Minimization and Language Modeling in Text Retrieval

by ChengXiang Zhai , 2002
"... ..."
Abstract - Cited by 29 (5 self) - Add to MetaCart
Abstract not found
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University