## Discretization for naive-Bayes learning: managing discretization bias and variance (2003)

### Cached

### Download Links

- [www.csse.monash.edu]
- [www.csse.monash.edu.au]
- [www.csse.monash.edu]
- []
- [www.csse.monash.edu.au]
- DBLP

### Other Repositories/Bibliography

Citations: | 19 - 5 self |

### BibTeX

@MISC{Yang03discretizationfor,

author = {Ying Yang and Geoffrey I. Webb},

title = {Discretization for naive-Bayes learning: managing discretization bias and variance},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Quantitative attributes are usually discretized in naive-Bayes learning. We prove a theorem that explains why discretization can be effective for naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we propose proportional k-interval discretization and equal size discretization, two efficient heuristic discretization methods that are able to effectively manage discretization bias and variance by tuning discretized interval size and interval number. We empirically evaluate our new techniques against five key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical arguments by showing that naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by alternative discretization methods.

### Citations

5438 | C4.5: Programs for Machine Learning - Quinlan - 1993 |

4178 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...ner is asked to predict a test instance x’s class according to the evidence provided by the training data. Expected classification error can be minimized by choosing argmaxc(p(C=c | X=x)) for each x (=-=Duda & Hart, 1973-=-). Bayes theorem can be used to calculate: p(C=c | X=x) = p(C=c)p(X=x | C=c) . (1) p(X=x) 4Since the denominator in (1) is invariant across classes, it does not affect the final choice and can be dro... |

3085 | UCI repository of machine learning databases - Blake, Merz - 1998 |

2640 | Density estimation for statistical and data analysis. Monographs on statistics and applied probability - Silverman - 1986 |

835 | A comparison of event models for naive bayes text classification - McCallum, Nigam - 1988 |

807 | The CN2 induction algorithm
- Clark, Niblett
- 1989
(Show Context)
Citation Context ...hns, 1960; Mitchell, 1997; Lewis, 1998). They were first introduced into machine learning as a straw man, against which new algorithms were compared and evaluated (Cestnik, Kononenko, & Bratko, 1987; =-=Clark & Niblett, 1989-=-; Cestnik, 1990). But it was soon realized that their classification performance was surprisingly good compared with other more complex classification algorithms (Kononenko, 1990; Langley, Iba, & Thom... |

716 | A re-examination of text categorization methods - Yang, Liu - 1999 |

703 | Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning - Fayyad, Irani - 1993 |

645 | On the optimality of the simple Bayesian classifier under zero-one loss - Domingos, Pazzani - 1997 |

638 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...here has also been considerable interest in developing variants of naive-Bayes learning that weaken the attribute independence assumption (Langley & Sage, 1994; Sahami, 1996; Singh & Provan, 1996; N. =-=Friedman, Geiger, & Goldszmidt, 1997-=-; Keogh & Pazzani, 1999; Zheng & Webb, 2000; Webb, Boughton, & Wang, 2005; Acid, Campos, & Castellano, 2005; Cerquides & Mántaras, 2005). from a quantitative attribute Xi. Each Classification tasks of... |

582 | An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants - Bauer, Kohavi - 1999 |

507 | A sequential algorithm for training text classiers - Lewis, Gale - 1994 |

497 |
Statistical Inference
- Casella, Berger
- 2002
(Show Context)
Citation Context ... can be estimated from the frequency of instances with C=c and the frequency of instances with Xi=xi∧C=c. These estimates are strong consistent estimates according to the strong law of large numbers (=-=Casella & Berger, 1990-=-; John & Langley, 1995). When it is quantitative, Xi often has a large or even an infinite number of values (Bluman, 1992; Samuels & Witmer, 1999). Thus the probability of a particular value xi given ... |

457 | Supervised and Unsupervised Discretization of Continuous Features
- Dougherty, Kohavi, et al.
- 1995
(Show Context)
Citation Context ...ethod’s discretization bias and variance, which we believe illuminating. 6.1 Equal width discretization & Equal frequency discretization Equal width discretization (EWD) (Catlett, 1991; Kerber, 1992; =-=Dougherty et al., 1995-=-) divides the number line between vmin and vmax into k intervals of equal width, where k is a user predefined parameter. Thus the intervals have width w=(vmax − vmin)/k and the cut points are at vmin ... |

452 | Hierarchically classifying documents using very few words - Koller, Sahami - 1997 |

385 | Naive (bayes) at forty: The independence assumption in information retrieval
- Lewis
- 1998
(Show Context)
Citation Context ...upport incremental training. These merits have seen them deployed in numerous classification tasks. They have long been a core technique in information retrieval (Maron & Kuhns, 1960; Mitchell, 1997; =-=Lewis, 1998-=-). They were first introduced into machine learning as a straw man, against which new algorithms were compared and evaluated (Cestnik, Kononenko, & Bratko, 1987; Clark & Niblett, 1989; Cestnik, 1990).... |

379 | A probabilistic analysis of the rocchio algorithm with tfidf for text categorization - Joachims - 1997 |

362 | An Analysis of Bayesian Classifiers
- Langley, Wayne, et al.
- 1992
(Show Context)
Citation Context ...lark & Niblett, 1989; Cestnik, 1990). But it was soon realized that their classification performance was surprisingly good compared with other more complex classification algorithms (Kononenko, 1990; =-=Langley, Iba, & Thompson, 1992-=-; Domingos & Pazzani, 1996, 1997). In consequence, naive-Bayes classifiers have widespread deployment in applications including medical diagnosis (Lavrac, 1998; Lavrac, Keravnou, & Zupan, 2000; Konone... |

343 | Estimating continuous distributions in bayesian classifiers - John, Langley - 1995 |

328 | Statistical comparisons of classifiers over multiple data sets - Demšar |

323 | Learning and revising user profiles: The identification of interesting web sites. Machine Learning 27:313–331 - Pazzani, Muramatsu, et al. - 1997 |

323 | Syskill & webert: Identifying interesting web sites - Pazzani, Muramatsu, et al. - 1996 |

313 | Beyond independence: conditions for the optimality of the simple Bayesian classi
- Domingos, M
- 1996
(Show Context)
Citation Context ...990). But it was soon realized that their classification performance was surprisingly good compared with other more complex classification algorithms (Kononenko, 1990; Langley, Iba, & Thompson, 1992; =-=Domingos & Pazzani, 1996-=-, 1997). In consequence, naive-Bayes classifiers have widespread deployment in applications including medical diagnosis (Lavrac, 1998; Lavrac, Keravnou, & Zupan, 2000; Kononenko, 2001), email filterin... |

268 | Toward optimal active learning through sampling estimation of error reduction - Roy, McCallum |

252 | Improving text classification by shrinkage in a hierarchy of classes - McCallum, Rosenfeld, et al. - 1998 |

236 | Content-based book recommending using learning for text categorization
- Mooney
- 1999
(Show Context)
Citation Context ... 2001), email filtering (Androutsopoulos, Koutsias, Chandrinos, & Spyropoulos, 2000; Crawford, Kay, & Eric, 2002), and recommender systems (Starr, Ackerman, & Pazzani, 1996; Miyahara & Pazzani, 2000; =-=Mooney & Roy, 2000-=-). There has also been considerable interest in developing variants of naive-Bayes learning that weaken the attribute independence assumption (Langley & Sage, 1994; Sahami, 1996; Singh & Provan, 1996;... |

233 | Induction of selective Bayesian classifiers
- Langley, Sage
- 1994
(Show Context)
Citation Context ...zzani, 1996; Miyahara & Pazzani, 2000; Mooney & Roy, 2000). There has also been considerable interest in developing variants of naive-Bayes learning that weaken the attribute independence assumption (=-=Langley & Sage, 1994-=-; Sahami, 1996; Singh & Provan, 1996; N. Friedman, Geiger, & Goldszmidt, 1997; Keogh & Pazzani, 1999; Zheng & Webb, 2000; Webb, Boughton, & Wang, 2005; Acid, Campos, & Castellano, 2005; Cerquides & Má... |

229 | An Evaluation of Phrasal and Clustered Representations on a Text Categorisation Task - Lewis - 1992 |

207 | On bias, variance 0/1 loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery
- Friedman
- 1997
(Show Context)
Citation Context ...is measured by their classification error. The error can be decomposed into a bias term, a variance term and an irreducible term (Kong & Dietterich, 1995; Breiman, 1996; Kohavi & Wolpert, 1996; J. H. =-=Friedman, 1997-=-; Webb, 2000). Bias describes the component of error that results from systematic error of the learning algorithm. Variance describes the component of error that results from random variation in the t... |

204 |
On Relevance, Probabilistic Indexing and Information Retrieval
- Maron, Kuhns
- 1997
(Show Context)
Citation Context ...e, efficient and robust, as well as support incremental training. These merits have seen them deployed in numerous classification tasks. They have long been a core technique in information retrieval (=-=Maron & Kuhns, 1960-=-; Mitchell, 1997; Lewis, 1998). They were first introduced into machine learning as a straw man, against which new algorithms were compared and evaluated (Cestnik, Kononenko, & Bratko, 1987; Clark & N... |

189 | Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid - Kohavi - 1996 |

186 | Bias plus variance decomposition for zero-one loss functions
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...ifiers discussed in our study is measured by their classification error. The error can be decomposed into a bias term, a variance term and an irreducible term (Kong & Dietterich, 1995; Breiman, 1996; =-=Kohavi & Wolpert, 1996-=-; J. H. Friedman, 1997; Webb, 2000). Bias describes the component of error that results from systematic error of the learning algorithm. Variance describes the component of error that results from ran... |

176 |
Estimating probabilities: A crucial task in machine learning
- Cestnik
- 1990
(Show Context)
Citation Context ...97; Lewis, 1998). They were first introduced into machine learning as a straw man, against which new algorithms were compared and evaluated (Cestnik, Kononenko, & Bratko, 1987; Clark & Niblett, 1989; =-=Cestnik, 1990-=-). But it was soon realized that their classification performance was surprisingly good compared with other more complex classification algorithms (Kononenko, 1990; Langley, Iba, & Thompson, 1992; Dom... |

176 |
ChiMerge: Discretization of Numeric Attributes
- Kerber
- 1992
(Show Context)
Citation Context ...alyzing each method’s discretization bias and variance, which we believe illuminating. 6.1 Equal width discretization & Equal frequency discretization Equal width discretization (EWD) (Catlett, 1991; =-=Kerber, 1992-=-; Dougherty et al., 1995) divides the number line between vmin and vmax into k intervals of equal width, where k is a user predefined parameter. Thus the intervals have width w=(vmax − vmin)/k and the... |

171 | Learning to classify text from labeled and unlabeled documents - Nigam, McCallum, et al. - 1998 |

170 |
The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance
- Friedman
- 1937
(Show Context)
Citation Context ... obtains lower, higher or equal classification error, compared with the naive-Bayes classifier trained with another discretization method. • Mean rank. Following the practice of the Friedman test (M. =-=Friedman, 1937-=-, 1940), for each data set, we rank competing algorithms. The one that leads to the best naive Bayes classification accuracy is ranked 1, the second best ranked 2, so on and so forth. A method’s mean ... |

168 |
On Changing Continuous Attributes into ordered discrete Attributes
- Catlett
- 1991
(Show Context)
Citation Context ...nterested in analyzing each method’s discretization bias and variance, which we believe illuminating. 6.1 Equal width discretization & Equal frequency discretization Equal width discretization (EWD) (=-=Catlett, 1991-=-; Kerber, 1992; Dougherty et al., 1995) divides the number line between vmin and vmax into k intervals of equal width, where k is a user predefined parameter. Thus the intervals have width w=(vmax − v... |

162 | Data mining using MLC++: a machine learning library in C - Kohavi - 1996 |

155 | The UCI KDD Archive, [http://kdd.ics.uci.edu - Hettich, Bay - 1999 |

155 | Error-correcting output coding corrects bias and variance
- Kong, Dietterich
- 1995
(Show Context)
Citation Context ...nce The performance of naive-Bayes classifiers discussed in our study is measured by their classification error. The error can be decomposed into a bias term, a variance term and an irreducible term (=-=Kong & Dietterich, 1995-=-; Breiman, 1996; Kohavi & Wolpert, 1996; J. H. Friedman, 1997; Webb, 2000). Bias describes the component of error that results from systematic error of the learning algorithm. Variance describes the c... |

146 | Combining classifiers in text categorization - Larkey, Croft - 1993 |

119 |
Reducing Misclassification Costs
- Pazzani, Merz, et al.
- 1994
(Show Context)
Citation Context ...tribute values are drawn. For instance, a conventional approach is to assume that a quantitative attribute’s probability within a class has a normal distribution (Langley, 1993; Langley & Sage, 1994; =-=Pazzani et al., 1994-=-; Mitchell, 1997). However, Pazzani (1995) argued that in many real-world applications the attribute data did not follow a normal distribution; and as a result, the probability estimation of naive-Bay... |

113 | Learning Limited Dependence Bayesian Classifiers
- Sahami
- 1996
(Show Context)
Citation Context ...& Pazzani, 2000; Mooney & Roy, 2000). There has also been considerable interest in developing variants of naive-Bayes learning that weaken the attribute independence assumption (Langley & Sage, 1994; =-=Sahami, 1996-=-; Singh & Provan, 1996; N. Friedman, Geiger, & Goldszmidt, 1997; Keogh & Pazzani, 1999; Zheng & Webb, 2000; Webb, Boughton, & Wang, 2005; Acid, Campos, & Castellano, 2005; Cerquides & Mántaras, 2005).... |

109 | An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages
- Androutsopoulos, Koutsias, et al.
- 2000
(Show Context)
Citation Context ...997). In consequence, naive-Bayes classifiers have widespread deployment in applications including medical diagnosis (Lavrac, 1998; Lavrac, Keravnou, & Zupan, 2000; Kononenko, 2001), email filtering (=-=Androutsopoulos, Koutsias, Chandrinos, & Spyropoulos, 2000-=-; Crawford, Kay, & Eric, 2002), and recommender systems (Starr, Ackerman, & Pazzani, 1996; Miyahara & Pazzani, 2000; Mooney & Roy, 2000). There has also been considerable interest in developing varian... |

108 |
Assistant 86: a knowledge-elicitation tool for sophisticated users
- CESTNIK, KONONENKO, et al.
- 1987
(Show Context)
Citation Context ...in information retrieval (Maron & Kuhns, 1960; Mitchell, 1997; Lewis, 1998). They were first introduced into machine learning as a straw man, against which new algorithms were compared and evaluated (=-=Cestnik, Kononenko, & Bratko, 1987-=-; Clark & Niblett, 1989; Cestnik, 1990). But it was soon realized that their classification performance was surprisingly good compared with other more complex classification algorithms (Kononenko, 199... |

100 | Multiboosting: a technique for combining boosting and wagging
- Webb
- 2000
(Show Context)
Citation Context ...ive-Bayes learning that weaken the attribute independence assumption (Langley & Sage, 1994; Sahami, 1996; Singh & Provan, 1996; N. Friedman, Geiger, & Goldszmidt, 1997; Keogh & Pazzani, 1999; Zheng & =-=Webb, 2000-=-; Webb, Boughton, & Wang, 2005; Acid, Campos, & Castellano, 2005; Cerquides & Mántaras, 2005). from a quantitative attribute Xi. Each Classification tasks often involve quantitative attributes. For na... |

98 | variance and arcing classifiers
- Breiman
(Show Context)
Citation Context ...ive-Bayes classifiers discussed in our study is measured by their classification error. The error can be decomposed into a bias term, a variance term and an irreducible term (Kong & Dietterich, 1995; =-=Breiman, 1996-=-; Kohavi & Wolpert, 1996; J. H. Friedman, 1997; Webb, 2000). Bias describes the component of error that results from systematic error of the learning algorithm. Variance describes the component of err... |

91 |
2002) Discretization: An Enabling Technique
- Liu, Hussain, et al.
(Show Context)
Citation Context ...method can affect its discretization bias and variance. Such a relationship has been hypothesized also by a number of previous authors (Pazzani, 1995; Torgo & Gama, 1997; Gama, Torgo, & Soares, 1998; =-=Hussain, Liu, Tan, & Dash, 1999-=-; Mora, Fortes, Morales, & Triguero, 2000; Hsu et 20al., 2000, 2003). Thus we anticipate that one way to manage discretization bias and variance is to adjust interval frequency and interval number. C... |

81 | Automatic Indexing: An Experimental Inquiry - Maron - 1961 |

80 | Hidden Naive Bayes - Zhang, Jiang, et al. - 2005 |