Results 1 - 10
of
36
BoosTexter: A Boosting-based System for Text Categorization
- MACHINE LEARNING
, 2000
"... This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categor ..."
Abstract
-
Cited by 373 (20 self)
- Add to MetaCart
This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparing the performance of BoosTexter and a number of other text-categorization algorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrained spoken customer responses.
Context-Sensitive Learning Methods for Text Categorization
- ACM Transactions on Information Systems
, 1996
"... this article, we will investigate the performance of two recently implemented machine-learning algorithms on a number of large text categorization problems. The two algorithms considered are set-valued RIPPER, a recent rule-learning algorithm [Cohen A earlier version of this article appeared in Proc ..."
Abstract
-
Cited by 213 (12 self)
- Add to MetaCart
this article, we will investigate the performance of two recently implemented machine-learning algorithms on a number of large text categorization problems. The two algorithms considered are set-valued RIPPER, a recent rule-learning algorithm [Cohen A earlier version of this article appeared in Proceedings of the 19th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR) pp. 307--315
Heterogeneous Uncertainty Sampling for Supervised Learning
- In Proceedings of the Eleventh International Conference on Machine Learning
, 1994
"... Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of instances that an expert need label. One problem with this approach is that the classifier best suit ..."
Abstract
-
Cited by 194 (3 self)
- Add to MetaCart
Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of instances that an expert need label. One problem with this approach is that the classifier best suited for an application may be too expensive to train or use during the selection of instances. We test the use of one classifier (a highly efficient probabilistic one) to select examples for training another (the C4.5 rule induction program). Despite being chosen by this heterogeneous approach, the uncertainty samples yielded classifiers with lower error rates than random samples ten times larger. 1 Introduction Machine learning algorithms have been used to build classification rules from data sets consisting of hundreds of thousands of instances [4]. In some applications unlabeled training instances are abundant but the cost of labeling an instance with its class is high. In the informatio...
Learning Trees and Rules with Set-valued Features
, 1996
"... In most learning systems examples are represented as fixed-length "feature vectors", the components of which are either real numbers or nominal values. We propose an extension of the featurevector representation that allows the value of a feature to be a set of strings; for instance, to represent a ..."
Abstract
-
Cited by 163 (2 self)
- Add to MetaCart
In most learning systems examples are represented as fixed-length "feature vectors", the components of which are either real numbers or nominal values. We propose an extension of the featurevector representation that allows the value of a feature to be a set of strings; for instance, to represent a small white and black dog with the nominal features size and species and the setvalued feature color, one might use a feature vector with size=small, species=canis-familiaris and color=fwhite,blackg. Since we make no assumptions about the number of possible set elements, this extension of the traditional feature-vector representation is closely connected to Blum's "infinite attribute" representation. We argue that many decision tree and rule learning algorithms can be easily extended to setvalued features. We also show by example that many real-world learning problems can be efficiently and naturally represented with set-valued features; in particular, text categorization problems and probl...
Learning Rules that Classify E-Mail
- In Papers from the AAAI Spring Symposium on Machine Learning in Information Access
"... Two methods for learning text classifiers are compared on classification problems that might arise in filtering and filing personal e-mail messages: a "traditional IR" method based on TF-IDF weighting, and a new method for learning sets of "keyword-spotting rules" based on the RIPPER rule learning a ..."
Abstract
-
Cited by 138 (1 self)
- Add to MetaCart
Two methods for learning text classifiers are compared on classification problems that might arise in filtering and filing personal e-mail messages: a "traditional IR" method based on TF-IDF weighting, and a new method for learning sets of "keyword-spotting rules" based on the RIPPER rule learning algorithm. It is demonstrated that both methods obtain significant generalizations from a small number of examples; that both methods are comparable in generalization performance on problems of this type; and that both methods are reasonably efficient, even with fairly large training sets. However, the greater comprehensibility of the rules may be advantageous in a system that allows users to extend or otherwise modify a learned classifier. Introduction Perhaps the most-discussed technical phenomenon of recent years has been the rapid growth of the Internet---or more generally, the rapid growth in the number of on-line documents. This has led to increased interest in intelligent methods for ...
Committee-Based Sampling For Training Probabilistic Classifiers
- In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... In many real-world learning tasks, it is expensive to acquire a sufficient number of labeled examples for training. This paper proposes a general method for efficiently training probabilistic classifiers, by selecting for training only the more informative examples in a stream of unlabeled examples. ..."
Abstract
-
Cited by 93 (3 self)
- Add to MetaCart
In many real-world learning tasks, it is expensive to acquire a sufficient number of labeled examples for training. This paper proposes a general method for efficiently training probabilistic classifiers, by selecting for training only the more informative examples in a stream of unlabeled examples. The method, committee-based sampling, evaluates the informativeness of an example by measuring the degree of disagreement between several model variants. These variants (the committee) are drawn randomly from a probability distribution conditioned by the training set selected so far (Monte-Carlo sampling). The method is particularly attractive because it evaluates the expected information gain from a training example implicitly, making the model both easy to implement and generally applicable. We further show how to apply committeebased sampling for training Hidden Markov Model classifiers, which are commonly used for complex classification tasks. The method was implemented and tested for ...
Moving Up the Information Food Chain: Deploying Softbots on the World Wide Web
- AI Magazine
, 1996
"... I view the World Wide Web as an information food chain (figure 1). The maze of pages and hyperlinks that comprise the Web are at the very bottom of the chain. The WebCrawlers and Alta Vistas of the world are information herbivores; they graze on Web pages and regurgitate them as searchable indices. ..."
Abstract
-
Cited by 85 (2 self)
- Add to MetaCart
I view the World Wide Web as an information food chain (figure 1). The maze of pages and hyperlinks that comprise the Web are at the very bottom of the chain. The WebCrawlers and Alta Vistas of the world are information herbivores; they graze on Web pages and regurgitate them as searchable indices. Today, most Web users feed near the bottom of the information food chain, but the time is ripe to move up. Since 1991, we have been building information carnivores, which intelligently hunt and feast on herbivores in Unix (Etzioni, Lesh, & Segal 1993), on the Internet (Etzioni & Weld 1994), and on the Web (Doorenbos, Etzioni, & Weld 1996; Selberg & Etzioni 1995; Shakes, Langheinrich, & Etzioni 1996). Motivation Today's Web is populated by a panoply of primitive but popular information services. Consider, for example, an information cow such as Alta Vista. Alta Vista requires massive memory resources (to store an index of the Web) and tremendous network bandwidth (to create and continually ...
Knowledge Discovery in Textual Databases (KDT)
- In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95
, 1995
"... The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this information flood. Knowledge Discovery in Databases (KDD) is a new paradigm that focuses on computerized exploration ..."
Abstract
-
Cited by 80 (2 self)
- Add to MetaCart
The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this information flood. Knowledge Discovery in Databases (KDD) is a new paradigm that focuses on computerized exploration of large amounts of data and on discovery of relevant and interesting patterns within them. While most work on KDD is concerned with structured databases, it is clear that this paradigm is required for handling the huge amount of information that is available only in unstructured textual form. To apply traditional KDD on texts it is necessary to impose some structure on the data that would be rich enough to allow for interesting KDD operations. On the other hand, we have to consider the severe limitations of current text processing technology and define rather simple structures that can be extracted from texts fairly automatically and in a reasonable cost. We propose using a text categoriza...
The World Wide Web: quagmire or gold mine?
- COMMUNICATIONS OF THE ACM
, 1996
"... This article considers the question: is effective Web mining possible? ..."
Abstract
-
Cited by 62 (0 self)
- Add to MetaCart
This article considers the question: is effective Web mining possible?
Text Categorization and Relational Learning
- In Proceedings of the Twelfth International Conference on Machine Learning
, 1995
"... We evaluate the first order learning system FOIL on a series of text categorization problems. It is shown that FOIL usually forms classifiers with lower error rates and higher rates of precision and recall with a relational encoding than with a propositional encoding. We show that FOIL's performance ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
We evaluate the first order learning system FOIL on a series of text categorization problems. It is shown that FOIL usually forms classifiers with lower error rates and higher rates of precision and recall with a relational encoding than with a propositional encoding. We show that FOIL's performance can be improved by relation selection, a first order analog of feature selection. Relation selection improves FOIL's performance as measured by any of recall, precision, F-measure, or error rate. With an appropriate level of relation selection, FOIL appears to be competitive with or superior to existing propositional techniques. 1 INTRODUCTION There is increasing interest in using intelligent systems to perform tasks like e-mail filtering, news filtering, and automatic indexing of documents. Many of these applications require the ability to classify text into one of several predefined categories, and in many of these applications, it would be highly advantageous to automatically learn such...

