Results 1 - 10
of
110
RCV1: A new benchmark collection for text categorization research
- Journal of Machine Learning Research
, 2004
"... Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data ..."
Abstract
-
Cited by 312 (5 self)
- Add to MetaCart
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of the real world constraints under which the data was produced. Drawing on interviews with Reuters personnel and access to Reuters documentation, we describe the coding policy and quality control procedures used in producing the RCV1 data, the intended semantics of the hierarchical category taxonomies, and the corrections necessary to remove errorful data. We refer to the original data as RCV1-v1, and the corrected data as RCV1-v2. We benchmark several widely used supervised learning methods on RCV1-v2, illustrating the collection’s properties, suggesting new directions for research, and providing baseline results for future studies. We make available detailed, per-category experimental results, as well as
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
, 1997
"... The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the ..."
Abstract
-
Cited by 285 (1 self)
- Add to MetaCart
The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also suggests improvements which lead to a probabilistic variant of the Rocchio classifier. The Rocchio classifier, its probabilistic variant, and a naive Bayes classifier are compared on six text categorization tasks. The results show that the probabilistic algorithms are preferable to the heuristic Rocchio classifier not only because they are more well-founded, but also because they achieve better performance.
Training Algorithms for Linear Text Classifiers
, 1996
"... Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides pe ..."
Abstract
-
Cited by 216 (12 self)
- Add to MetaCart
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Context-Sensitive Learning Methods for Text Categorization
- ACM Transactions on Information Systems
, 1996
"... this article, we will investigate the performance of two recently implemented machine-learning algorithms on a number of large text categorization problems. The two algorithms considered are set-valued RIPPER, a recent rule-learning algorithm [Cohen A earlier version of this article appeared in Proc ..."
Abstract
-
Cited by 213 (12 self)
- Add to MetaCart
this article, we will investigate the performance of two recently implemented machine-learning algorithms on a number of large text categorization problems. The two algorithms considered are set-valued RIPPER, a recent rule-learning algorithm [Cohen A earlier version of this article appeared in Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR) pp. 307--315
Interactive Deduplication using Active Learning
, 2002
"... Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to ov ..."
Abstract
-
Cited by 161 (3 self)
- Add to MetaCart
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.
We present our design of a learning-based deduplication
system that uses a novel method of interactively discovering
challenging training pairs using active learning. Our
experiments on real-life datasets show that active learning
signicantly reduces the number of instances needed to
achieve high accuracy. We investigate various design issues
that arise in building a system to provide interactive
response, fast convergence, and interpretable output.
A comparison of classifiers and document representations for the routing problem
- ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR
, 1995
"... In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant ..."
Abstract
-
Cited by 147 (2 self)
- Add to MetaCart
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Automatic Query Expansion Using SMART : TREC 3
- In Proceedings of The third Text REtrieval Conference (TREC-3
"... The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments. Our major focus is massive query expansion: ad ..."
Abstract
-
Cited by 139 (2 self)
- Add to MetaCart
The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments. Our major focus is massive query expansion: adding from 300 to 530 terms to each query. These terms come from known relevant documents in the case of routing, and from just the top retrieved documents in the case of ad-hoc and Spanish. This approach improves effectiveness from 7% to 25% in the various experiments. Other ad-hoc work extends our investigations into combining global similarities, giving an overall indication of how a document matches a query, with local similarities identifying a smaller part of the document which matches the query. Using an overlapping text window definition of "local", we achieve a 16% improvement. Introduction For over 30 years, the Smart project at Cornell University has been interested in the analy...
Incremental Relevance Feedback for Information Filtering
, 1996
"... We use data from the TREC routing experiments to explore how relevance feedback can be applied incrementally --- using a few judged documents each time --- to achieve results that are as good as if the feedback occurred in one pass. We show that relatively few judgments are needed to get highquality ..."
Abstract
-
Cited by 90 (4 self)
- Add to MetaCart
We use data from the TREC routing experiments to explore how relevance feedback can be applied incrementally --- using a few judged documents each time --- to achieve results that are as good as if the feedback occurred in one pass. We show that relatively few judgments are needed to get highquality results. We also demonstrate methods that reduce the amount of information archived from past judged documents without adversely affecting effectiveness. A novel simulation shows that such techniques are useful for handling long-standing queries with drifting notions of relevance.
Text categorization of low quality images
- In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval
, 1995
"... Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. De ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. Despite this, we show for one data set that fax quality images can be categorized with nearly the same accuracy as the original text. Further, the categorization system can be trained on noisy OCR output, without need for the true text of any image, or for editing of OCR output. The useofavector space classi er and training method robust to large feature sets, combined with discarding of low frequency OCR output strings are the key to our approach. 1
Learning Routing Queries in a Query Zone
, 1997
"... Word usage is domain dependent. A common word in one domain can be quite infrequent in another. In this study we exploit this property of word usage to improve document routing. We show that routing queries (profiles) learned only from the documents in a query domain are better than the routing prof ..."
Abstract
-
Cited by 50 (4 self)
- Add to MetaCart
Word usage is domain dependent. A common word in one domain can be quite infrequent in another. In this study we exploit this property of word usage to improve document routing. We show that routing queries (profiles) learned only from the documents in a query domain are better than the routing profiles learned when query domains are not used. We approximate a query domain by a query zone. Experiments show that routing profiles learned from a query zone are 8--12% more effective than the profiles generated when no query zoning is used. 1 Background Document routing is an important problem in the field of information retrieval. [12] When a user has marked several articles as relevant to his/her information need, a system should be able to automatically learn the user's "profile" and should be able to route (send) new, potentially interesting, articles to the user. This problem has also been called as selective dissemination of information or information filtering. [4] Most current st...

