Results 11 - 20
of
294
Multi-label text classification with a mixture model trained by EM
- AAAI 99 Workshop on Text Learning
, 1999
"... In many important document classification tasks, documents may each be associated with multiple class labels. This paper describes a Bayesian classification approach in which the multiple classes that comprise a document are represented by a mixture model. While the labeled training data indicates w ..."
Abstract
-
Cited by 95 (3 self)
- Add to MetaCart
In many important document classification tasks, documents may each be associated with multiple class labels. This paper describes a Bayesian classification approach in which the multiple classes that comprise a document are represented by a mixture model. While the labeled training data indicates which classes were responsible for generating a document, it does not indicate which class was responsible for generating each word. Thus we use EM to fill in this missing value, learning both the distribution over mixture weights and the word distribution in each class's mixture component. We describe the benefits of this model and present preliminary results with the Reuters-21578 data set.
Boosting and Rocchio Applied to Text Filtering
- In Proceedings of ACM SIGIR
, 1998
"... We discuss two learning algorithms for text filtering: modified Rocchio and a boosting algorithm called AdaBoost. We show how both algorithms can be adapted to maximize any general utility matrix that associates cost (or gain) for each pair of machine prediction and correct label. We first show that ..."
Abstract
-
Cited by 91 (2 self)
- Add to MetaCart
We discuss two learning algorithms for text filtering: modified Rocchio and a boosting algorithm called AdaBoost. We show how both algorithms can be adapted to maximize any general utility matrix that associates cost (or gain) for each pair of machine prediction and correct label. We first show that AdaBoost significantly outperforms another highly effective text filtering algorithm. We then compare AdaBoost and Rocchio over three large text filtering tasks. Overall both algorithms are comparable and are quite effective. AdaBoost produces better classifiers than Rocchio when the training collection contains a very large number of relevant documents. However, on these tasks, Rocchio runs much faster than AdaBoost. 1
A Study of Approaches to Hypertext Categorization
- Journal of Intelligent Information Systems
, 2002
"... . Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that informatio ..."
Abstract
-
Cited by 78 (3 self)
- Add to MetaCart
. Hypertext poses new research challenges for text classification. Hyperlinks, HTML tags, category labels distributed over linked documents, and meta data extracted from related web sites all provide rich information for classifying hypertext documents. How to appropriately represent that information and automatically learn statistical patterns for solving hypertext classification problems is an open question. This paper seeks a principled approach to providing the answers. Specifically, we define five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier. Using three hypertext datasets and three well-known learning algorithms (Naive Bayes, Nearest Neighbor, and First Order Inductive Learner), we examine these regularities in different domains, and compare alternative ways to exploit them. Our experimental results suggest that a naive use of linked pages, such as treating the words in the linked neighborhood of a page as local to that page, can be more harmful than helpful when the linked neighborhood is highly "noisy". This is especially true if the classifier is not sufficiently robust in discriminating informative words from noisy ones. It is also evident in our results that extracting meta data (when available) from related web sites can be extremely useful for improving classification accuracy. Finally, the relative performance of the classifiers being tested provides insights into their strengths and limitations for solving classification problems involving diverse and often noisy web pages. Keywords: hypertext classification, machine learning, web mining 1.
A Personal News Agent that Talks, Learns and Explains
- In Proceedings of the Third International Conference on Autonomous Agents
, 1999
"... Most work on intelligent information agents has thus far focused on systems that are accessible through the World Wide Web. As demanding schedules prohibit people from continuous access to their computers, there is a clear demand for information systems that do not require workstation access or grap ..."
Abstract
-
Cited by 77 (2 self)
- Add to MetaCart
Most work on intelligent information agents has thus far focused on systems that are accessible through the World Wide Web. As demanding schedules prohibit people from continuous access to their computers, there is a clear demand for information systems that do not require workstation access or graphical user interfaces. We present a personal news agent that is designed to become part of an intelligent, IP-enabled radio, which uses synthesized speech to read news stories to a user. Based on voice feedback from the user, the system automatically adapts to the user's preferences and interests. In addition to time-coded feedback, we explore two components of the system that facilitate the automated induction of accurate interest profiles. First, we motivate the use of a multistrategy machine learning approach that allows for the induction of user models that consist of separate models for long-term and short-term interests. Second, we investigate the use of "concept feedback", a novel fo...
Centroid-Based Document Classification: Analysis Experimental Results
, 2000
"... . In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroid-based classifier consistently and substantially outperforms ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
. In this paper we present a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our experiments show that this centroid-based classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroidbased scheme allows it to classify a new document based on how closely its behavior matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities and accounts for dependencies between the terms in the different classes. 1 Introduction We have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranets. It has been forecasted that these docu...
Boosting trees for anti-spam email filtering
- In Proceedings of RANLP-01, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG
, 2001
"... This paper describes a set of comparative experiments for the problem of automatically ltering unwanted electronic mail messages. Several variants of the AdaBoost algorithm with con dence{ rated predictions (Schapire & Singer 99) have been applied, which di er in the complexity of the base learners ..."
Abstract
-
Cited by 73 (0 self)
- Add to MetaCart
This paper describes a set of comparative experiments for the problem of automatically ltering unwanted electronic mail messages. Several variants of the AdaBoost algorithm with con dence{ rated predictions (Schapire & Singer 99) have been applied, which di er in the complexity of the base learners considered. Two main conclusions can be drawn from our experiments: a) The boosting{based methods clearly outperform the baseline learning algorithms (Naive Bayes and Induction of Decision Trees) on the PU1 corpus, achieving very high levels of the F1 measure � b) Increasing the complexity of the base learners allows to obtain better \high{precision " classi ers, which isavery important issue when misclassication costs are considered. 1
Evaluating Topic-Driven Web Crawlers
, 2001
"... Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies t ..."
Abstract
-
Cited by 72 (19 self)
- Add to MetaCart
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
Mining the Biomedical Literature in the Genomic Era: An Overview
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 2003
"... The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last f ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years there is a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature, and find the nuggets of information most relevant and useful for specific analysis tasks. This paper
Hierarchical Text Classification and Evaluation
, 2001
"... Hierarchical Classification refers to assigning of one or more suitable categories from a hierarchical category space to a document. While previous work in hierarchical classification focused on virtual category trees where documents are assigned only to the leaf categories, we propose a topdown lev ..."
Abstract
-
Cited by 71 (2 self)
- Add to MetaCart
Hierarchical Classification refers to assigning of one or more suitable categories from a hierarchical category space to a document. While previous work in hierarchical classification focused on virtual category trees where documents are assigned only to the leaf categories, we propose a topdown level-based classification method that can classify documents to both leaf and internal categories. As the standard performance measures assume independence between categories, they have not considered the documents incorrectly classified into categories that are similar or not far from the correct ones in the category tree. We therefore propose the Category-Similarity Measures and DistanceBased Measures to consider the degree of misclassification in measuring the classification performance. An experiment has been carried out to measure the performance of our proposed hierarchical classification method. The results showed that our method performs well for Reuters text collection when enough training documents are given and the new measures have indeed considered the contributions of misclassified documents.
A Study on Thresholding Strategies for Text Categorization
- Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval
, 2001
"... Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluatio ..."
Abstract
-
Cited by 69 (8 self)
- Add to MetaCart
Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the "optimal" strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly. 1.

