Results 1  10
of
16
Discretization for naiveBayes learning: managing discretization bias and variance
, 2003
"... Quantitative attributes are usually discretized in naiveBayes learning. We prove a theorem that explains why discretization can be effective for naiveBayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naiveBay ..."
Abstract

Cited by 36 (8 self)
 Add to MetaCart
(Show Context)
Quantitative attributes are usually discretized in naiveBayes learning. We prove a theorem that explains why discretization can be effective for naiveBayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naiveBayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naiveBayes classification error. In particular, we propose proportional kinterval discretization and equal size discretization, two efficient heuristic discretization methods that are able to effectively manage discretization bias and variance by tuning discretized interval size and interval number. We empirically evaluate our new techniques against five key discretization methods for naiveBayes classifiers. The experimental results support our theoretical arguments by showing that naiveBayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by alternative discretization methods.
A Comparative Study of Discretization Methods for NaiveBayes Classifiers
 In Proceedings of PKAW 2002: The 2002 Pacific Rim Knowledge Acquisition Workshop
, 2002
"... Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naiveBayes learning and many other learning algorithms. We evaluate the effectiveness with naiveBayes classifiers of nine discretizati ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
(Show Context)
Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naiveBayes learning and many other learning algorithms. We evaluate the effectiveness with naiveBayes classifiers of nine discretization methods, equal width discretization (EWD), equal frequency discretization (EFD), fuzzy discretization (FD), entropy minimization discretization (EMD), iterative discretization (ID), proportional kinterval discretization (PKID), lazy discretization (LD), nondisjoint discretization (NDD) and weighted proportional kinterval discretization (WPKID). It is found that in general naiveBayes classifiers trained on data preprocessed by LD, NDD or WPKID achieve lower classification error than those trained on data preprocessed by the other discretization methods. But LD can not scale to large data. This study leads to a new discretization method, weighted nondisjoint discretization (WNDD) that combines WPKID and NDD's advantages. Our experiments show that among all the rival discretization methods, WNDD best helps naiveBayes classifiers reduce average classification error.
Weighted Proportional kInterval Discretization for NaiveBayes Classifiers
 in: Proc. of the PAKDD
, 2003
"... Abstract. The use of different discretization techniques can be expected to affect the classification bias and variance of naiveBayes classifiers. We call such an effect discretization bias and variance. Proportional kinterval discretization (PKID) tunes discretization bias and variance by adjustin ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
Abstract. The use of different discretization techniques can be expected to affect the classification bias and variance of naiveBayes classifiers. We call such an effect discretization bias and variance. Proportional kinterval discretization (PKID) tunes discretization bias and variance by adjusting discretized interval size and number proportional to the number of training instances. Theoretical analysis suggests that this is desirable for naiveBayes classifiers. However PKID is suboptimal when learning from training data of small size. We argue that this is because PKID equally weighs bias reduction and variance reduction. But for small data, variance reduction can contribute more to lower learning error and thus should be given greater weight than bias reduction. Accordingly we propose weighted proportional kinterval discretization (WPKID), which establishes a more suitable bias and variance tradeoff for small data while allowing additional training data to be used to reduce both bias and variance. Our experiments demonstrate that for naiveBayes classifiers, WPKID improves upon PKID for smaller datasets 1 with significant frequency; and WPKID delivers lower classification error significantly more often than not in comparison to three other leading alternative discretization techniques studied. 1
On Why Discretization Works for NaiveBayes Classifiers
 In Proceedings of the 16th Australian Joint Conference on Artificial Intelligence (AI
, 2003
"... We investigate why discretization is effective in naiveBayes learning. We prove a theorem that identifies particular conditions under which discretization will result in naiveBayes classifiers delivering the same probability estimates as would be obtained if the correct probability density functio ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
We investigate why discretization is effective in naiveBayes learning. We prove a theorem that identifies particular conditions under which discretization will result in naiveBayes classifiers delivering the same probability estimates as would be obtained if the correct probability density functions were employed.
Proportional kinterval discretization for naiveBayes classifiers
 Proc. of the Twelfth European Conf. on Machine Learning
, 2001
"... Abstract. This paper argues that two commonlyused discretization approaches, fixed kinterval discretization and entropybased discretization have suboptimal characteristics for naiveBayes classification. This analysis leads to a new discretization method, Proportional kInterval Discretization ( ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
Abstract. This paper argues that two commonlyused discretization approaches, fixed kinterval discretization and entropybased discretization have suboptimal characteristics for naiveBayes classification. This analysis leads to a new discretization method, Proportional kInterval Discretization (PKID), which adjusts the number and size of discretized intervals to the number of training instances, thus seeks an appropriate tradeoff between the bias and variance of the probability estimation for naiveBayes classifiers. We justify PKID in theory, as well as test it on a wide crosssection of datasets. Our experimental results suggest that in comparison to its alternatives, PKID provides naiveBayes classifiers competitive classification performance for smaller datasets and better classification performance for larger datasets. 1
Segmented regression estimators for massive data sets
 In Second SIAM International Conference on Data Mining
, 2002
"... We describe two methodologies for obtaining segmented regression estimators from massive training data sets. The first methodology, called Linear Regression Tree (LRT), is used for continuous response variables, and the second and complementary methodology, called Naive Bayes Tree (NBT), is used for ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
(Show Context)
We describe two methodologies for obtaining segmented regression estimators from massive training data sets. The first methodology, called Linear Regression Tree (LRT), is used for continuous response variables, and the second and complementary methodology, called Naive Bayes Tree (NBT), is used for categorical response variables. These are implemented in the IBM ProbE TM (Probabilistic Estimation) data mining engine, which is an objectoriented framework for building classes of segmented predictive models from massive training data sets. Based on this methodology, an application called ATMSE TM for directmail targeted marketing has been developed jointly with Fingerhut Business Intelligence [1]).
Sawtooth: Learning from huge amounts of data
, 2004
"... Data scarcity has been a problem in data mining up until recent times. Now, in the era of the Internet and the tremendous advances in both, data storage devices and highspeed computing, databases are filling up at rates never imagined before. The machine learning problems of the past have been augm ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Data scarcity has been a problem in data mining up until recent times. Now, in the era of the Internet and the tremendous advances in both, data storage devices and highspeed computing, databases are filling up at rates never imagined before. The machine learning problems of the past have been augmented by an increasingly important one, scalability. Extracting useful information from arbitrarily large data collections or data streams is now of special interest within the data mining community. In this research we find that mining from such large datasets may actually be quite simple. We address the scalability issues of previous widelyused batch learning algorithms and discretization techniques used to handle continuous values within the data. Then, we describe an incremental algorithm that addresses the scalability problem of Bayesian classifiers, and propose a Bayesiancompatible online discretization technique that handles continuous values, both with a “simplicity first ” approach and very low memory (RAM) requirements. To my family. To Nana. iii iv
Nondisjoint discretization for naiveBayes classifiers
 Proc. Nineteenth International Conference on Machine Learning
, 2002
"... Previous discretization techniques have discretized numeric attributes into disjoint intervals. We argue that this is neither necessary nor appropriate for naiveBayes classifiers. The analysis leads to a new discretization method, NonDisjoint Discretization (NDD). NDD forms overlapping intervals f ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Previous discretization techniques have discretized numeric attributes into disjoint intervals. We argue that this is neither necessary nor appropriate for naiveBayes classifiers. The analysis leads to a new discretization method, NonDisjoint Discretization (NDD). NDD forms overlapping intervals for a numeric attribute, always locating a value toward the middle of an interval to obtain more reliable probability estimation. It also adjusts the number and size of discretized intervals to the number of training instances, seeking an appropriate tradeoff between bias and variance of probability estimation. We justify NDD in theory and test it on a wide crosssection of datasets. Our experimental results suggest that for naiveBayes classifiers, NDD works better than alternative discretization approaches. 1.
Internet Traffic Classification Demystified: On the Sources of the Discriminative Power
"... Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on “Why " some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the “Why " question, which is cri ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Recent research on Internet traffic classification has yield a number of data mining techniques for distinguishing types of traffic, but no systematic analysis on “Why " some algorithms achieve high accuracies. In pursuit of empirically grounded answers to the “Why " question, which is critical in understanding and establishing a scientific ground for traffic classification research, this paper reveals the three sources of the discriminative power in classifying the Internet application traffic: (i) ports, (ii) the sizes of the first onetwo (for UDP flows) or fourfive (for TCP flows) packets, and (iii) discretization of those features. We find that C4.5 performs the best under any circumstances, as well as the reason why; because the algorithm discretizes input features during classification operations. We also find that the entropybased Minimum Description Length discretization on ports and packet size features substantially improve the classification accuracy of every machine learning algorithm tested (by as much as 59.8%!) and make all of them achieve>93 % accuracy on average without any algorithmspecific tuning processes. Our results indicate that dealing with the ports and packet size features as discrete nominal intervals, not as continuous numbers, is the essential basis for accurate traffic classification (i.e., the features should be discretized first), regardless of classification algorithms to use.
MultiView 3D Object Description with Uncertain Reasoning and Machine Learning
, 2001
"... xi Chapter 1. ..."