## Supervised and Unsupervised Discretization of Continuous Features (1995)

### Cached

### Download Links

- [ai.stanford.edu]
- [robotics.stanford.edu]
- [robotics.stanford.edu]
- [www.dsi.unive.it]
- DBLP

### Other Repositories/Bibliography

Venue: | MACHINE LEARNING: PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE |

Citations: | 445 - 10 self |

### BibTeX

@INPROCEEDINGS{Dougherty95supervisedand,

author = {James Dougherty and Ron Kohavi and Mehran Sahami},

title = {Supervised and Unsupervised Discretization of Continuous Features},

booktitle = {MACHINE LEARNING: PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE},

year = {1995},

pages = {194--202},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many supervised machine learning algorithms require a discrete feature space. In this paper, we review previous work on continuous feature discretization, identify defining characteristics of the methods, and conduct an empirical evaluation of several methods. We compare binning, an unsupervised discretization method, to entropy-based and purity-based methods, which are supervised algorithms. We found that the performance of the Naive-Bayes algorithm significantly improved when features were discretized using an entropy-based method. In fact, over the 16 tested datasets, the discretized version of Naive-Bayes slightly outperformed C4.5 on average. We also show that in some cases, the performance of the C4.5 induction algorithm significantly improved if features were discretized in advance; in our experiments, the performance never significantly degraded, an interesting phenomenon considering the fact that C4.5 is capable of locally discretizing features.

### Citations

1414 |
Self-Organization and Associative Memory
- Kohonen
- 1989
(Show Context)
Citation Context ...itioning of a continuous feature in O(m(log m + k 2 )) time, where k is the number of intervals and m is the number of instances. This method has yet to be tested experimentally. Vector Quantization (=-=Kohonen 1989-=-) is also related to the notion of discretization. This method attempts to partition an N-dimensional continuous space into a Voronoi Tessellation and then represent the set of points in each region b... |

843 |
C4.5: Programs for
- Quinlan
- 1993
(Show Context)
Citation Context ... transformations for a learning algorithm, and no careful study of how this discretization affects the learning process is performed (Weiss & Kulikowski 1991). In decision tree methods, such as C4.5 (=-=Quinlan 1993-=-), continuous values are discretized during the learning process. The advantages of discretizing during the learning process have not yet been shown. In this paper, we include such a comparison. Other... |

830 | A study of the cross-validation and bootstrap for accuracy estimation and model selection, p. 1137–1143
- KOHAVI
- 1995
(Show Context)
Citation Context ...ne continuous feature. For the datasets that had more than 3000 test instances, we ran a single train/test experiment and report the theoretical standard deviation estimated using the Binomial model (=-=Kohavi 1995-=-). For the remaining datasets, we ran five-fold cross-validation and report the standard deviation of the cross-validation. Table 2 describes the datasets with the last column showing the accuracy of ... |

772 |
UCI Repository of machine learning databases
- Murphy, Aha
- 2001
(Show Context)
Citation Context ...istinct observed values for each attribute. The heuristic was chosen based on examining S-plus's histogram binning algorithm (Spector 1994). We chose sixteen datasets from the U.C. Irvine repository (=-=Murphy & Aha 1994-=-) that each had at least one continuous feature. For the datasets that had more than 3000 test instances, we ran a single train/test experiment and report the theoretical standard deviation estimated ... |

693 |
Multi-interval discretization of continuousvalued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...ization methods require some parameter, k, indicating the maximum number of intervals to produce in discretizing a feature. Static methods, such as binning, entropy-based partitioning (Catlett 1991b, =-=Fayyad & Irani 1993-=-, Pfahringer 1995), and the 1R algorithm (Holte 1993), perform one discretization pass of the data for each feature and determine the value of k for each feature independent of the other features. Dyn... |

628 | Irrelevant features and the subset selection problem - John, Kohavi, et al. - 1994 |

469 | Very simple classification rules perform well on most commonly used datasets. Machine Learning 3:63–91
- Holte
- 1993
(Show Context)
Citation Context ...imum number of intervals to produce in discretizing a feature. Static methods, such as binning, entropy-based partitioning (Catlett 1991b, Fayyad & Irani 1993, Pfahringer 1995), and the 1R algorithm (=-=Holte 1993-=-), perform one discretization pass of the data for each feature and determine the value of k for each feature independent of the other features. Dynamic methods conduct a search through the space of p... |

359 | An analysis of bayesian classifiers - Langley, Iba, et al. - 1992 |

258 |
Learning from Observations: Conceptual Clustering”, In: Ma- chine Learning an Artificial Intelligence Approach
- Michalski, Stepp
- 1983
(Show Context)
Citation Context ...non considering the fact that C4.5 is capable of locally discretizing features. 1 Introduction Many algorithms developed in the machine learning community focus on learning in nominal feature spaces (=-=Michalski & Stepp 1983-=-, Kohavi 1994). However, many real-world classi cation tasks exist that involve continuous features where such algorithms could not be applied unless the continuous features are rst discretized. Conti... |

204 |
Boolean feature discovery and empirical learning
- Pagallo, Haussler
- 1990
(Show Context)
Citation Context ...tion method and did not signi cantly degrade on any dataset, although it did decrease slightly on some. The entropy-based discretization is a global method and does not su er from data fragmentation (=-=Pagallo & Haussler 1990-=-). Since there is no signi cantDataset C4.5 Continuous Bin-log ` Entropy 1RD Ten Bins 1 anneal 91.65 1.60 90.32 1.06 89.65 1.00 87.20 1.66 89.87 1.30 2 australian 85.36 0.74 84.06 0.97 85.65 1.82 85.... |

174 |
Chimerge discretization of numeric attributes
- Kerber
- 1991
(Show Context)
Citation Context ...tting partition boundaries, it is likely that classification information will be lost by binning as a result of combining values that are strongly associated with different classes into the same bin (=-=Kerber 1992-=-). In some cases this could make effective classification much more difficult. A variation of equal frequency intervals---maximal marginal entropy--- adjusts the boundaries to decrease entropy at each... |

166 |
On changing continuous attributes into ordered discrete attributes
- Catlett
- 1991
(Show Context)
Citation Context ... In this paper, we include such a comparison. Other reasons for variable discretization, aside from the algorithmic requirements mentioned above, include increasing the speed of induction algorithms (=-=Catlett 1991-=-b) and viewing General Logic Diagrams (Michalski 1978) of the induced classifier. In this paper, we address the effects of discretization on learning accuracy by comparing a range of discretization me... |

98 | MLC++: A machine learning library in C - Kohavi, John, et al. - 1994 |

89 |
Megainduction: Machine Learning on Very Large Databases
- Catlett
- 1991
(Show Context)
Citation Context ... In this paper, we include such a comparison. Other reasons for variable discretization, aside from the algorithmic requirements mentioned above, include increasing the speed of induction algorithms (=-=Catlett 1991-=-b) and viewing General Logic Diagrams (Michalski 1978) of the induced classifier. In this paper, we address the effects of discretization on learning accuracy by comparing a range of discretization me... |

49 |
Induction of one-level decision trees
- Iba, Langley
(Show Context)
Citation Context ...soever and is thus an unsupervised discretization method. 3.2 Holte's 1R Discretizer Holte (1993) describes a simple classi er that induces one-level decision trees, sometimes called decision stumps (=-=Iba & Langley 1992-=-). In order to properly deal with domains that contain continuous valued features, a simple supervised discretization method is given. This method, referred to here as 1RD (OneRule Discretizer), sorts... |

48 | Global discretization of continuous attributes as preprocessing for machine learning
- Chmielewski, Grzymala-Busse
(Show Context)
Citation Context ...al vs. local, supervised vs. unsupervised, andstatic vs. dynamic. Local methods, as exempli ed by C4.5, produce partitions that are applied to localized regions of the instance space. Global methods (=-=Chmielewski & Grzymala-Busse 1994-=-), such as binning, produce a mesh over the entire n-dimensional continuous instance space, where each feature is partitioned into regions independent of the other attributes. The mesh contains Qn i=1... |

47 | Bottom-up induction of oblivious, read-once decision graphs
- Kohavi
- 1994
(Show Context)
Citation Context ... that C4.5 is capable of locally discretizing features. 1 Introduction Many algorithms developed in the machine learning community focus on learning in nominal feature spaces (Michalski & Stepp 1983, =-=Kohavi 1994-=-). However, many real-world classification tasks exist that involve continuous features where such algorithms could not be applied unless the continuous features are first discretized. Continuous vari... |

41 | Compression-based discretization of continuous attributes
- Pfahringer
- 1995
(Show Context)
Citation Context ...re some parameter, k, indicating the maximum number of intervals to produce in discretizing a feature. Static methods, such as binning, entropy-based partitioning (Catlett 1991b, Fayyad & Irani 1993, =-=Pfahringer 1995-=-), and the 1R algorithm (Holte 1993), perform one discretization pass of the data for each feature and determine the value of k for each feature independent of the other features. Dynamic methods cond... |

37 | Efficient agnostic PAC-learning with simple hypotheses - Maass - 1994 |

34 | Efficient algorithms for finding multi-way splits for decision trees - Fulton, Kasif, et al. - 1995 |

25 |
Planar Geometrical Model for Representing Multi-Dimensional Discrete Spaces and Multiple-Valued Logic Functions
- Michalski, “A
(Show Context)
Citation Context ... reasons for variable discretization, aside from the algorithmic requirements mentioned above, include increasing the speed of induction algorithms (Catlett 1991b) and viewing General Logic Diagrams (=-=Michalski 1978-=-) of the induced classifier. In this paper, we address the effects of discretization on learning accuracy by comparing a range of discretization methods using C4.5 and a Naive Bayes classifier. The Na... |

23 |
Very simple classi cation rules perform well on most commonly used datasets
- Holte
- 1993
(Show Context)
Citation Context ...imum number of intervals to produce in discretizing a feature. Static methods, such as binning, entropy-based partitioning (Catlett 1991b, Fayyad & Irani 1993, Pfahringer 1995), and the 1R algorithm (=-=Holte 1993-=-), perform one discretization pass of the data for each feature and determine the value of k for each feature independent of the other features. Dynamic methods conduct a search through the space of p... |

19 | An analysis of Bayesian classiers - Langley, Iba, et al. - 1992 |

11 | Determination of quantization intervals in rule based model for dynamic systems - Chan, Batur, et al. - 1991 |

4 | An e cient algorithm for nding multi-way splits for decision trees, Unpublished paper - Fulton, Kasif, et al. - 1994 |

2 | Information synthesis based on hierarchical entropy discretization - Chiu, Cheung - 1990 |