Results 1 - 10
of
45
A Probabilistic Model of Information Retrieval: Development and Status
, 1998
"... The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Eac ..."
Abstract
-
Cited by 206 (16 self)
- Add to MetaCart
The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations.
Content-Based Book Recommending Using Learning for Text Categorization
- IN PROCEEDINGS OF THE FIFTH ACM CONFERENCE ON DIGITAL LIBRARIES
, 1999
"... Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. By contra ..."
Abstract
-
Cited by 141 (6 self)
- Add to MetaCart
Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use collaborative filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommend previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.
Limitations of Co-Training for Natural Language Learning from Large Datasets
- In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing
, 2001
"... Co-Training is a weakly supervised learning paradigm in which the redundancy of the learning task is captured by training two classifiers using separate views of the same data. This enables bootstrapping from a small set of labeled training data via a large set of unlabeled data. This study examines ..."
Abstract
-
Cited by 72 (3 self)
- Add to MetaCart
Co-Training is a weakly supervised learning paradigm in which the redundancy of the learning task is captured by training two classifiers using separate views of the same data. This enables bootstrapping from a small set of labeled training data via a large set of unlabeled data. This study examines the learning behavior of co-training on natural language processing tasks that typically require large numbers of training instances to achieve usable performance levels. Using base noun phrase bracketing as a case study, we find that co-training reduces by 36% the di#erence in error between co-trained classifiers and fully supervised classifiers trained on a labeled version of all available data. However, degradation in the quality of the bootstrapped data arises as an obstacle to further improvement. To address this, we propose a moderately supervised variant of cotraining in which a human corrects the mistakes made during automatic labeling. Our analysis suggests that corrected co-training and similar moderately supervised methods may help cotraining scale to large natural language learning tasks. 1
An effective approach to document retrieval via utilizing wordnet and recognizing phrases
- In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
, 2004
"... Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different ..."
Abstract
-
Cited by 50 (9 self)
- Add to MetaCart
Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different and are determined using a decision tree. Phrases are more important than individual terms. Consequently, documents in response to a query are ranked with matching phrases given a higher priority. We utilize WordNet to disambiguate word senses of query terms. Whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. Experimental results show that our approach yields between 23 % and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.
Phrase Recognition and Expansion for Short, Precision-biased Queries based on a Query Log
"... In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the paramete ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
In this paper we examine the question of query parsing for World Wide Web queries and present a novel method for phrase recognition and expansion. Given a training corpus of approximately 16 million Web queries and a handwritten context-free grammar, the EM algorithm is used to estimate the parameters of a probabilistic context-free grammar (PCFG) with a system developed by Carroll [5]. We use the PCFG to compute the most probable parse for a user query, reflecting linguistic structure and word usage of the domain being parsed. The optimal syntactic parse for a user query thus obtained is employed for phrase recognition and expansion. Phrase recognition is used to increase retrieval precision; phrase expansion is applied to make the best use possible of very short Web queries.
Feature Weighting in k-Means Clustering
- Machine Learning
, 2002
"... Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable (and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning (possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex k-means algorithm, and (v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average within-cluster dispersion and maximizes the average between-cluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.
Boosting web retrieval through query operations
- Proceedings ECIR 2005
, 2005
"... We explore the use of phrase and proximity terms in the context of web retrieval, which is different from traditional ad-hoc retrieval both in document structure and in query characteristics. We show that for this type of task, the usage of both phrase and proximity terms is highly beneficial for e ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
We explore the use of phrase and proximity terms in the context of web retrieval, which is different from traditional ad-hoc retrieval both in document structure and in query characteristics. We show that for this type of task, the usage of both phrase and proximity terms is highly beneficial for early precision as well as for overall retrieval effectiveness. We also analyze why phrase and proximity terms are far more effective for web retrieval than for ad-hoc retrieval.
Term Extraction and Automatic Indexing
, 2003
"... This chapter presents a new domain of research and development in Natural Language Processing (NLP) that is concerned with the representation, acquisition, and recognition of terms. Terms are pervasive in scientific and technical documents; their identification is a crucial issue for any applicatio ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This chapter presents a new domain of research and development in Natural Language Processing (NLP) that is concerned with the representation, acquisition, and recognition of terms. Terms are pervasive in scientific and technical documents; their identification is a crucial issue for any application dealing with the analysis, understanding, generation, or translation of such documents. In particular, the ever-growing mass of specialized documentation available on-line, in industrial and governmental archives or in digital libraries, calls for advances in terminology processing for such purposes as information retrieval, cross-language querying, indexing of multimedia documents, translation aids, document routing and summarization, etc. This chapter introduces the basic linguistic characteristics of terms. It presents the main methods in NLP for recognizing or discovering terms and their interrelationships in large corpora. It is divided into three sections: an introduction to the bas...
Biterm Language Models for Document Retrieval
- In Proceedings of SIGIR
, 2002
"... Introduction Statistical Language Models(LM) have been used in many natural language processing tasks including speech recognition and machine translation [5, 2]. Recently language models have been explored as a framework for information retrieval [9, 4, 7, 1, 6]. The basic idea is to view each doc ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Introduction Statistical Language Models(LM) have been used in many natural language processing tasks including speech recognition and machine translation [5, 2]. Recently language models have been explored as a framework for information retrieval [9, 4, 7, 1, 6]. The basic idea is to view each document to have its own language model and model querying as a generative process. Documents are ranked based on the probability of their language model generating the given query. Since documents are fixed entities in information retrieval, language models for documents su#er from sparse data problem. Smoothed unigram models have been used to demonstrate better performance of language models against vector space or probabilistic retrieval models for document retrieval. Song and Croft [10] proposed a general language model that combined bigram language models with Good-Turing estimate and corpus-based smoothing of unigram probabilities. Improved performance was observed with combined bigram l
Language models for searching in Web corpora
- THE THIRTEENTH TEXT RETRIEVAL CONFERENCE (TREC 2004). NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY. NIST SPECIAL PUBLICATION
, 2005
"... We describe our participation in the ..."

