Results 1  10
of
82
A comparison of event models for Naive Bayes text classification
, 1998
"... Recent work in text classification has used two different firstorder probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multivariate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey ..."
Abstract

Cited by 752 (26 self)
 Add to MetaCart
Recent work in text classification has used two different firstorder probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multivariate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a unigram language model with integer word counts (e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multivariate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizesproviding on average a 27% reduction in error over the multivariate Bernoulli model at any vocabulary size.
On the optimality of the simple Bayesian classifier under zeroone loss
 MACHINE LEARNING
, 1997
"... The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containin ..."
Abstract

Cited by 600 (25 self)
 Add to MetaCart
The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive. This article shows that, although the Bayesian classifier’s probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zeroone loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadraticloss optimality of the Bayesian classifier is in fact a secondorder infinitesimal fraction of the region of zeroone optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article’s results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.
Inductive Learning Algorithms and Representations for Text Categorization
, 1998
"... Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categori ..."
Abstract

Cited by 501 (9 self)
 Add to MetaCart
Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy. We also examine training set size, and alternative document representations. Very accurate text classifiers can be learned automatically from training examples. Linear Support Vector Machines (SVMs) are particularly promising because they are very accurate, quick to train, and quick to evaluate. 1.1 Keywords Text categorization, classification, support vector machines, machine learning, information management.
Hierarchically Classifying Documents Using Very Few Words
, 1997
"... The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text ..."
Abstract

Cited by 420 (9 self)
 Add to MetaCart
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to util...
A bayesian approach to filtering junk Email, in: Learning for Text Categorization
 Papers from the 1998 Workshop, AAAI
, 1998
"... In addressing the growing problem of junk Email on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream. By casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning m ..."
Abstract

Cited by 385 (6 self)
 Add to MetaCart
In addressing the growing problem of junk Email on the Internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user’s mail stream. By casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunction with a notion of differential misclassification cost to produce filters Which are especially appropriate for the nuances of this task. While this may appear, at first, to be a straightforward text classification problem, we show that by considering domainspecific features of this problem in addition to the raw text of Email messages, we can produce much more accurate filters. Finally, we show the efficacy of such filters in a real world usage scenario, arguing that this technology is mature enough for deployment.
Using Maximum Entropy for Text Classification
, 1999
"... This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, partofspeech tagging, and text segmentation. The underlying principl ..."
Abstract

Cited by 261 (5 self)
 Add to MetaCart
This paper proposes the use of maximum entropy techniques for text classification. Maximum entropy is a probability distribution estimation technique widely used for a variety of natural language tasks, such as language modeling, partofspeech tagging, and text segmentation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally nonuniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm. In this paper, maximum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experiments on several text datasets we compare accuracy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work remains, but the re...
AttentionSensitive Alerting
, 1998
"... We introduce utilitydirected procedures for mediating the flow of potentially distracting alerts and communications to computer users. We present models and inference procedures that balance the contextsensitive costs of deferring alerts with the cost of interruption. We describe the challen ..."
Abstract

Cited by 184 (24 self)
 Add to MetaCart
We introduce utilitydirected procedures for mediating the flow of potentially distracting alerts and communications to computer users. We present models and inference procedures that balance the contextsensitive costs of deferring alerts with the cost of interruption. We describe the challenge of reasoning about such costs under uncertainty via an analysis of user activity and the content of notifications. After introducing principles of attentionsensitive alerting, we focus on the problem of guiding alerts about email messages. We dwell on the problem of inferring the expected criticality of email and discuss work on the Priorities system, centering on prioritizing email by criticality and modulating the communication of notifications to users about the presence and nature of incoming email. 1 Introduction Multitasking computer systems provide great value to users by hosting numerous processes and applications simultaneously. However, the ongoing execution of mu...
Learning Bayesian network structure from massive datasets: The “sparse candidate” algorithm
, 1999
"... Learning Bayesian networks is often cast as an optimization problem, where the computational task is to find a structure that maximizes a statistically motivated score. By and large, existing learning tools address this optimization problem using standard heuristic search techniques. Since the searc ..."
Abstract

Cited by 180 (9 self)
 Add to MetaCart
Learning Bayesian networks is often cast as an optimization problem, where the computational task is to find a structure that maximizes a statistically motivated score. By and large, existing learning tools address this optimization problem using standard heuristic search techniques. Since the search space is extremely large, such search procedures can spend most of the time examining candidates that are extremely unreasonable. This problem becomes critical when we deal with data sets that are large either in the number of instances, or the number of attributes. In this paper, we introduce an algorithm that achieves faster learning by restricting the search space. This iterative algorithm restricts the parents of each variable to belong to a small subset of candidates. We then search for a network that satisfies these constraints. The learned network is then used for selecting better candidates for the next iteration. We evaluate this algorithm both on synthetic and reallife data. Our results show that it is significantly faster than alternative search procedures without loss of quality in the learned structures. 1
Not so naive Bayes: Aggregating onedependence estimators
 Machine Learning
, 2005
"... Of numerous proposals to improve the accuracy of naive Bayes by weakening its attribute independence assumption, both LBR and superparent TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
Of numerous proposals to improve the accuracy of naive Bayes by weakening its attribute independence assumption, both LBR and superparent TAN have demonstrated remarkable error performance. However, both techniques obtain this outcome at a considerable computational cost. We present a new approach to weakening the attribute independence assumption by averaging all of a constrained class of classifiers. In extensive experiments this technique delivers comparable prediction accuracy to LBR and superparent TAN with substantially improved computational e#ciency at test time relative to the former and at training time relative to the latter. The new algorithm is shown to have low variance and is suited to incremental learning.
Dynamic Bayesian Multinets
, 2000
"... In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how informationtheoretic criterion functions can be used to induce spa ..."
Abstract

Cited by 59 (18 self)
 Add to MetaCart
In this work, dynamic Bayesian multinets are introduced where a Markov chain state at time t determines conditional independence patterns between random variables lying within a local time window surrounding t. It is shown how informationtheoretic criterion functions can be used to induce sparse, discriminative, and classconditional network structures that yield an optimal approximation to the class posterior probability, and therefore are useful for the classification task. Using a new structure learning heuristic, the resulting models are tested on a mediumvocabulary isolatedword speech recognition task. It is demonstrated that these discriminatively structured dynamic Bayesian multinets, when trained in a maximum likelihood setting using EM, can outperform both HMMs and other dynamic Bayesian networks with a similar number of parameters. 1 Introduction While Markov chains are sometimes a useful model for sequences, such simple independence assumptions can lead...