## A Comparative Study of Discretization Methods for Naive-Bayes Classifiers (2002)

Venue: | In Proceedings of PKAW 2002: The 2002 Pacific Rim Knowledge Acquisition Workshop |

Citations: | 20 - 0 self |

### BibTeX

@INPROCEEDINGS{Yang02acomparative,

author = {Ying Yang and Geoffrey I. Webb},

title = {A Comparative Study of Discretization Methods for Naive-Bayes Classifiers},

booktitle = {In Proceedings of PKAW 2002: The 2002 Pacific Rim Knowledge Acquisition Workshop},

year = {2002},

pages = {159--173}

}

### OpenURL

### Abstract

Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naive-Bayes learning and many other learning algorithms. We evaluate the effectiveness with naive-Bayes classifiers of nine discretization methods, equal width discretization (EWD), equal frequency discretization (EFD), fuzzy discretization (FD), entropy minimization discretization (EMD), iterative discretization (ID), proportional k-interval discretization (PKID), lazy discretization (LD), nondisjoint discretization (NDD) and weighted proportional k-interval discretization (WPKID). It is found that in general naive-Bayes classifiers trained on data preprocessed by LD, NDD or WPKID achieve lower classification error than those trained on data preprocessed by the other discretization methods. But LD can not scale to large data. This study leads to a new discretization method, weighted non-disjoint discretization (WNDD) that combines WPKID and NDD's advantages. Our experiments show that among all the rival discretization methods, WNDD best helps naive-Bayes classifiers reduce average classification error.

### Citations

5438 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...nation condition to form categorical attributes with few values. For decision tree learning, it is important to minimize the number of values of an attribute, so as to avoid the fragmentation problem =-=[28]-=-. If an attribute has many values, a split on this attribute will result in many branches, each of which receives relatively few training instances, making it difficult to select appropriate subsequen... |

3085 |
UCI repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ...’s complexity is prohibitively high when the training data are large. 6 Experimental Validation 6.1 Experimental Setting We run experiments on 35 natural datasets from UCI machine learning repositor=-=y [4]-=- and KDD archive [2]. These datasets vary extensively in the number of instances and the dimension of the attribute space. Table 1 describes each dataset, including the number of instances (Size), num... |

703 |
Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...ls dominated by a single class for naive-Bayes classifiers than for decision trees or decision rules. Thus discretization methods that pursue pure intervals (containing instances with the same class) =-=[1, 5, 10, 11, 14, 15, 19, 29]-=- might not suit naiveBayes classifiers. Besides, naive-Bayes classifiers deem attributes conditionally independent of each other and do not use attribute combinations as predictors. There is no need t... |

645 | On the optimality of the simple Bayesian classifier under zero-one loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...ayes classifiers are simple, efficient and robust to noisy data. One limitation is that the attribute independence assumption in (3) is often violated in the real world. However, Domingos and Pazzani =-=[8]-=- suggest that this limitation has less impact than might be expected because classification under zero-one loss is only a function of the sign of the probability estimation; the classification accurac... |

474 | Very Simple Classification Rules Perform Well on Most Commonly Used Datasets
- Holte
- 1993
(Show Context)
Citation Context ...ls dominated by a single class for naive-Bayes classifiers than for decision trees or decision rules. Thus discretization methods that pursue pure intervals (containing instances with the same class) =-=[1, 5, 10, 11, 14, 15, 19, 29]-=- might not suit naiveBayes classifiers. Besides, naive-Bayes classifiers deem attributes conditionally independent of each other and do not use attribute combinations as predictors. There is no need t... |

457 | Supervised and Unsupervised Discretization of Continuous Features
- Dougherty, Kohavi, et al.
- 1995
(Show Context)
Citation Context ...ibutes are often preprocessed by discretization as the classification performance tends to be better when numeric attributes are discretized than when they are assumed to follow a normal distribution =-=[9]. For -=-each numeric attribute X, a categorical attribute X ∗ is created. Each value of X ∗ corresponds to an interval of values of X. X ∗ is used instead of X for training a classifier. In Proceedings ... |

343 | Estimating continuous distributions in bayesian classifiers
- John, Langley
- 1995
(Show Context)
Citation Context ...etermined by a density function f which satisfies [30]: 1. f(xi) ≥ 0, ∀xi ∈ Si; 2. � Si f(Xi)dXi = 1; 3. � bi ai f(Xi)dXi = p(ai < Xi ≤ bi), ∀(ai, bi] ∈ Si. p(Xi = xi | C = c) can be e=-=stimated from f [17]. But fo-=-r real-world data, f is usually unknown. Under discretization, a categorical attribute X ∗ i is formed for Xi. Each value x ∗ i of X∗ i corresponds to an interval (ai, bi] of Xi. If xi ∈ (ai, ... |

207 | On bias, variance 0/1 loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery
- Friedman
- 1997
(Show Context)
Citation Context ...cretization (WPKID) WPKID [35] is an improved version of PKID. It is credible that for smaller datasets, variance reduction can contribute more to lower naive-Bayes learning error than bias reduction =-=[12]-=-. Thus fewer intervals each containing more instances would be of greater utility. Accordingly WPKID weighs discretization variance reduction more than bias reduction by setting a minimum interval siz... |

176 |
Estimating probabilities: A crucial task in machine learning
- Cestnik
- 1990
(Show Context)
Citation Context ...an be estimated with reasonable accuracy from the frequency of instances with C = c and the frequency of instances with Xi = xi ∧ C = c in the training data. In our experiment:s– The Laplace-estim=-=ate [6] wa-=-s used to estimate p(C = c): nc+k N+n×k , where nc is the number of instances satisfying C = c, N is the number of training instances, n is the number of classes and k = 1. – The M-estimate [6] was... |

176 |
ChiMerge: Discretization of Numeric Attributes
- Kerber
- 1992
(Show Context)
Citation Context ...ls dominated by a single class for naive-Bayes classifiers than for decision trees or decision rules. Thus discretization methods that pursue pure intervals (containing instances with the same class) =-=[1, 5, 10, 11, 14, 15, 19, 29]-=- might not suit naiveBayes classifiers. Besides, naive-Bayes classifiers deem attributes conditionally independent of each other and do not use attribute combinations as predictors. There is no need t... |

168 |
On Changing Continuous Attributes into ordered discrete Attributes
- Catlett
- 1991
(Show Context)
Citation Context |

155 |
The UCI KDD Archive, [http://kdd.ics.uci.edu
- Hettich, Bay
- 1999
(Show Context)
Citation Context ...hibitively high when the training data are large. 6 Experimental Validation 6.1 Experimental Setting We run experiments on 35 natural datasets from UCI machine learning repository [4] and KDD archive =-=[2]-=-. These datasets vary extensively in the number of instances and the dimension of the attribute space. Table 1 describes each dataset, including the number of instances (Size), numeric attributes (Num... |

100 | Multiboosting: a technique for combining boosting and wagging
- Webb
- 2000
(Show Context)
Citation Context ...1.7 - 31.4 31.7 31.4 ME - - - - 20.1 19.9 20.9 19.5 19.1 18.6 19.1 18.7 18.2 GM - - - - 1.15 1.14 1.19 1.09 1.02 1.00 1.02 1.00 0.97 Geometric mean error ratio. This method has been explained by Webb =-=[32]-=-. It allows for the relative difficulty of error reduction in different datasets and can be more reliable than the mean ratio of errors across datasets. Win/lose/tie record. The three values are respe... |

75 | Inductive and Bayesian learning in medical diagnosis
- Kononenko
- 1993
(Show Context)
Citation Context ...ppropriately approximates the distribution of a numeric attribute [16]. 4.3 Fuzzy Discretization (FD) There are three versions of fuzzy discretization proposed by Kononenko for naiveBayes classifiers =-=[20, 21]. -=-They differ in how the estimation of p(ai < Xi ≤ bi | C = c) in (5) is obtained. Because space limits, we present here only the version that, according to our experiments, best reduces the classific... |

49 | Global discretization of continuous attributes as preprocessing for machine learning
- Chmielewski, Grzymala-Busse
- 1990
(Show Context)
Citation Context ...an be achieved by tuning the interval size and number to find a good trade-off between the bias and variance. Suppose the desired interval size is s and the desired interval number is t, PKID employs =-=(7) -=-to calculate s and t: s × t = n s = t. (7) PKID discretizes the sorted values into intervals with size s. Thus PKID gives equal weight to discretization bias reduction and variance reduction by setti... |

26 | Zeta: A global method for discretization of continuous variable
- Ho, Scott
- 1997
(Show Context)
Citation Context |

24 |
Class-Driven Statistical Discretization of Continuous Attributes
- Richeldi, Rossotto
- 1995
(Show Context)
Citation Context |

22 |
Statistics: Principles and Methods
- Johnson, Bhattacharyya
- 1996
(Show Context)
Citation Context ...scretization for Naive-Bayes Classifiers An attribute is either categorical or numeric. Values of a categorical attribute are discrete. Values of a numeric attribute are either discrete or continuous =-=[18]-=-. p(Xi = xi | C = c) in (4) is modelled by a real number between 0 and 1, denoting the probability that the attribute Xi will take the particular value xi when the class is c. This assumes that attrib... |

19 | An evolutionary algorithm using multivariate discretization for decision rule induction - Kwedlo, Kretowski - 1999 |

19 | A multivariate discretization method for learning Bayesian networks from mixed data - Monti, Cooper - 1998 |

17 | Multivariate discretization of continuous variables for set mining
- Bay
- 2000
(Show Context)
Citation Context ... C = c). So a simplification is made: if attributes X1, X2, · · · , Xk are conditionally independent of each other given the class, then p(X = x | C = c) = p(∧Xi = xi | C = c) = � p(Xi = xi | C=-= = c). (3) Com-=-bining (2) and (3), one can further estimate the probability by: p(C = c | X = x) ∝ p(C = c) � p(Xi = xi | C = c). (4) Classifiers using (4) are called naive-Bayes classifiers. Naive-Bayes classif... |

16 | Why discretization works for naive bayesian classifiers
- Hsu, Huang, et al.
- 2000
(Show Context)
Citation Context ...iscretized attributes have Dirichlet priors, and ‘Perfect Aggregation’ of Dirichlets can ensure that naive-Bayes with discretization appropriately approximates the distribution of a numeric attrib=-=ute [16]. -=-4.3 Fuzzy Discretization (FD) There are three versions of fuzzy discretization proposed by Kononenko for naiveBayes classifiers [20, 21]. They differ in how the estimation of p(ai < Xi ≤ bi | C = c)... |

16 | An iterative improvement approach for the discretization of numeric attributes in bayesian classi
- Pazzani
- 1995
(Show Context)
Citation Context ...nherent in naive-Bayes classifiers, further reducing their capability to accurately classify in the context of violation of the attribute independence assumption. 4.5 Iterative Discretization (ID) ID =-=[25] i-=-nitially forms a set of intervals using EWD or MED, and then iteratively adjusts the intervals to minimize naive-Bayes classifiers’ classification error onsthe training data. It defines two operator... |

16 | 2001) Proportional k-Interval Discretization for Naïve-Bayes Classifiers
- Yang, Webb
(Show Context)
Citation Context ...onsumption. Since our experiments include large datasets, we hereby only introduce ID for the integrality of our study, without implementing it. 4.6 Proportional k-Interval Discretization (PKID) PKID =-=[33] a-=-djusts discretization bias and variance by tuning the interval size and number, and further adjusts the naive-Bayes’ probability estimation bias and variance to achieve lower classification error. T... |

15 | Relative Unsupervised Discretization for Association Rule Mining - Ludl, Widmer |

14 | I.: Weighted Proportional k-Interval Discretization for Naive-Bayes Classifiers
- Yang, Webb
- 2003
(Show Context)
Citation Context ...timation. For simplicity, we take k = 3 for demonstration.sAtomic Interval Interval Fig. 1. Atomic Intervals Compose Actual Intervals 4.9 Weighted Proportional k-Interval Discretization (WPKID) WPKID =-=[35]-=- is an improved version of PKID. It is credible that for smaller datasets, variance reduction can contribute more to lower naive-Bayes learning error than bias reduction [12]. Thus fewer intervals eac... |

12 |
Discretization of continuous attributes for learning classification rules
- An, Cercone
- 1999
(Show Context)
Citation Context |

11 |
Naive bayesian classifier and continuous attributes
- Kononenko
- 1992
(Show Context)
Citation Context ...ppropriately approximates the distribution of a numeric attribute [16]. 4.3 Fuzzy Discretization (FD) There are three versions of fuzzy discretization proposed by Kononenko for naiveBayes classifiers =-=[20, 21]. -=-They differ in how the estimation of p(ai < Xi ≤ bi | C = c) in (5) is obtained. Because space limits, we present here only the version that, according to our experiments, best reduces the classific... |

9 | Multi-interval discretization methods for decision tree learning - Perner, Trautzsch - 1998 |

8 | Dynamic discretization of continuous attributes - Gama, Torgo, et al. - 1998 |

7 |
Speeding up knowledge discovery in large relational databases by means of a new discretization algorithm
- Freitas, Lavington
- 1996
(Show Context)
Citation Context ...th high dimensional attribute spaces and huge numbers of instances are increasingly used in real-world applications, a study of these methods’ performance on large datasets is necessary and desirabl=-=e [11, 27]-=-. Nine discretization methods are included in this comparative study, each of which is either designed especially for naive-Bayes classifiers or is in practice often used for naive-Bayes classifiers. ... |

6 | Non-disjoint discretization for naive-bayes classifiers
- Yang, Webb
- 2002
(Show Context)
Citation Context ...tion tasks with large training or test data. In our experiments, we implement LD with the interval size equal to that created by PKID, as this has been shown to outperform the original implementation =-=[16, 34]-=-. 4.8 Non-Disjoint Discretization (NDD) NDD [34] forms overlapping intervals for Xi, always locating a value xi toward the middle of its corresponding interval (ai, bi]. The idea behind NDD is that wh... |

5 | Concurrent discretization of multiple attributes - Wang, Liu - 1998 |

4 |
Scaling up machine learning with massive parallelism
- Provost, Aronis
- 1996
(Show Context)
Citation Context ...th high dimensional attribute spaces and huge numbers of instances are increasingly used in real-world applications, a study of these methods’ performance on large datasets is necessary and desirabl=-=e [11, 27]-=-. Nine discretization methods are included in this comparative study, each of which is either designed especially for naive-Bayes classifiers or is in practice often used for naive-Bayes classifiers. ... |

2 |
Probability and Statistics for Engineers, fourth ed
- Scheaffer, McClave
- 1995
(Show Context)
Citation Context ...space of Xi, for any particular xi ∈ Si, the probability p(Xi = xi) will be arbitrarily close to 0. The probability distribution of Xi is completely determined by a density function f which satisfie=-=s [30]: 1. f(xi) ≥ 0-=-, ∀xi ∈ Si; 2. � Si f(Xi)dXi = 1; 3. � bi ai f(Xi)dXi = p(ai < Xi ≤ bi), ∀(ai, bi] ∈ Si. p(Xi = xi | C = c) can be estimated from f [17]. But for real-world data, f is usually unknown. U... |