Results 1 - 10
of
26
Parsimonious Language Models for Information Retrieval
- In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 2004
"... We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, ..."
Abstract
-
Cited by 216 (37 self)
- Add to MetaCart
We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models that are typically used for smoothing. As such, they need fewer (non-zero) parameters to describe the data. We apply parsimonious models at three stages of the retrieval process:1) at indexing time; 2) at search time; 3) at feedback time. Experimental results show that we are able to build models that are significantly smaller than standard models, but that still perform at least as well as the standard approaches.
Embedding web-based statistical translation models in cross-language information retrieval
- Computational Linguistics
, 2003
"... Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since cu ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Although more and more language pairs are covered by machine translation (MT) services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application that needs translation functionality of a relatively low level of sophistication, since current models for information retrieval (IR) are still based on a bag of words. The Web provides a vast resource for the automatic construction of parallel corpora that can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this article, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost. 1.
UMass at TREC 2002: Cross Language and Novelty Tracks
, 2002
"... this report. Recall that one of the two sub-runs that made up UMassX2 and UMassX2n, and three of the six sub-runs that made up UMassX6 and UMassX6n, used the standard parallel corpus dictionary and stemmer. The Standard Resources column of Table 3 shows the results when the sub-runs based on the UMa ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
this report. Recall that one of the two sub-runs that made up UMassX2 and UMassX2n, and three of the six sub-runs that made up UMassX6 and UMassX6n, used the standard parallel corpus dictionary and stemmer. The Standard Resources column of Table 3 shows the results when the sub-runs based on the UMass resources and acronym expansion were excluded. Only the sub-runs based on the standard resources were included. Thus, in the Standard Resources column, the UMassX2 and UMassX2n rows show the results of a single sub-ran, and UMassX6 and UMassX6n rows each show results based on a combination of three, rather than six, sub-ms. Relative to these three way combinations, the additional resources increased average precision 3 percentage points for title+description queries, and 4 points for title+description+narrative queries
Statistical Language Models for Information Retrieval. Tutorial Presentation at the
- 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR
, 2006
"... Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for model ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for modeling nontraditional retrieval problems. In general, statistical language models provide a principled way of modeling various kinds of retrieval problems. The purpose of this survey is to systematically and critically review the existing work in applying statistical language models to information retrieval, summarize their contributions, and point out outstanding challenges. 1
Statistical Cross-Language Information Retrieval using N-Best Query Translations
, 2002
"... This paper presents a novel statistical model for crosslanguage information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable t ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
This paper presents a novel statistical model for crosslanguage information retrieval. Given a written query in the source language, documents in the target language are ranked by integrating probabilities computed by two statistical models: a query-translation model, which generates most probable term-by-term translations of the query, and a query-document model, which evaluates the likelihood of each document and translation. Integration of the two scores is performed over the set of N most probable translations of the query. Experimental results with values N = 1, 5, 10 are presented on the Italian-English bilingual track data used in the CLEF 2000 and 2001 evaluation campaigns.
Mining Correlated Bursty Topic Patterns from Coordinated Text Streams
- KDD'07
, 2007
"... Previous work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text mining. For example, when a major event happens, all ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
Previous work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text mining. For example, when a major event happens, all the news articles published by different agencies in different languages tend to cover the same event for a certain period, exhibiting a correlated bursty topic pattern in all the news article streams. In general, mining correlated bursty topic patterns from coordinated text streams can reveal interesting latent associations or events behind these streams. In this paper, we define and study this novel text mining problem. We propose a general probabilistic algorithm which can effectively discover correlated bursty patterns and their bursty periods across text streams even if the streams have completely different vocabularies (e.g., English vs Chinese). Evaluation of the proposed method on a news data set and a literature data set shows that it can effectively discover quite meaningful topic patterns from both data sets: the patterns discovered from the news data set accurately reveal the major common events covered in the two streams of news articles (in English and Chinese, respectively), while the patterns discovered from two database publication streams match well with the major research paradigm shifts in database research. Since the proposed method is general and does not require the streams to share vocabulary, it can be applied to any coordinated text streams to discover correlated topic patterns that burst in multiple streams in the same period.
Simple translation models for sentence retrieval in factoid question answering
- in Proceedings of the Special Interest Group on Information Retrieval (SIGIR) 2004
, 2004
"... Many question-answering systems start with a passage retrieval system to facilitate the answer extraction process. The richer the set of passages, in terms of answer content, the more accurate the answer extraction. We present a simple translation model for passage retrieval at the sentence level. W ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Many question-answering systems start with a passage retrieval system to facilitate the answer extraction process. The richer the set of passages, in terms of answer content, the more accurate the answer extraction. We present a simple translation model for passage retrieval at the sentence level. We demonstrate this framework on TREC data, and show that it performs better than retrieval based on query likelihood, and on par with other systems. 1.
A database approach to content-based XML retrieval
- In (Fuhr et al
"... This paper describes a first prototype system for content-based retrieval from XML data. The system's design supports both XPath queries and complex information retrieval queries based on a language modelling approach to information retrieval. Evaluation using the INEX benchmark shows that it is ben ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
This paper describes a first prototype system for content-based retrieval from XML data. The system's design supports both XPath queries and complex information retrieval queries based on a language modelling approach to information retrieval. Evaluation using the INEX benchmark shows that it is beneficial if the system is biased to retrieve large XML fragments over small fragments.
Hindi CLIR in thirty days
- ACM Transactions on Asian Language Information Processing (TALIP
, 2003
"... As participants in the TIDES Surprise Language exercise, researchers at the University of Massachusetts helped collect Hindi-English resources and developed a cross-language information retrieval system. Components included normalization, stop-word removal, transliteration, structured query translat ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
As participants in the TIDES Surprise Language exercise, researchers at the University of Massachusetts helped collect Hindi-English resources and developed a cross-language information retrieval system. Components included normalization, stop-word removal, transliteration, structured query translation, and language modeling using a probabilistic dictionary derived from a parallel corpus. Existing technology was successfully applied to Hindi. The biggest stumbling blocks were collection of parallel English and Hindi text and dealing with numerous proprietary encodings.

