Results 1 -
7 of
7
Linguistically motivated probabilistic model of information retrieval
- In Proceedings of European Conference on Digital Libraries
, 1998
"... Abstract. This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is e ..."
Abstract
-
Cited by 63 (15 self)
- Add to MetaCart
Abstract. This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×idf term weighting. The paper shows that the new probabilistic interpretation of tf×idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the Cranfield test collection indicates that the presented model outperforms the vector space model with classical tf×idf and cosine length normalisation.
Empirical Development of an Exponential Probabilistic Model for Text Retrieval: Using Textual Analysis to Build a Better Model
- In Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval
, 2003
"... Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined and refined independent of the particular retrieval algorithm. We explore the explicit assumptions underlying the naive Bayesian framework by performing computational analysis of actual corpora and queries to devise a generative document model that closely matches text. Our thesis is that a model so developed will be more accurate than existing models, and thus more useful in retrieval, as well as other applications. We test this by learning from a corpus the best document model. We find the learned model better predicts the existence of text data and has improved performance on certain IR tasks.
Relating the New Language Models of Information Retrieval to the Traditional Retrieval Models
, 2000
"... During the last two years, exciting new approaches to information retrieval were introduced by a number of different research groups that use statistical language models for retrieval. ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
During the last two years, exciting new approaches to information retrieval were introduced by a number of different research groups that use statistical language models for retrieval.
Speech-driven text retrieval: Using target IR collections for statistical language model adaptation in speech recognition
- Information Retrieval Techniques for Speech Applications (LNCS 2273
, 2002
"... Abstract. Speech recognition has of late become a practical technology for real world applications. Aiming at speech-driven text retrieval, which facilitates retrieving information with spoken queries, we propose a method to integrate speech recognition and retrieval methods. Since users speak conte ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract. Speech recognition has of late become a practical technology for real world applications. Aiming at speech-driven text retrieval, which facilitates retrieving information with spoken queries, we propose a method to integrate speech recognition and retrieval methods. Since users speak contents related to a target collection, we adapt statistical language models used for speech recognition based on the target collection, so as to improve both the recognition and retrieval accuracy. Experiments using existing test collections combined with dictated queries showed the effectiveness of our method. 1
Using Self-Supervised Word Segmentation in Chinese Information Retrieval
"... We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performanc ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.
Applying Machine Learning to Text Segmentation for Information Retrieval
, 2002
"... We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments o ..."
Abstract
- Add to MetaCart
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are exible and adaptive. We nd that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give comparable (in some cases even better) retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we nd that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this eect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach.

