## Discretizing Continuous Attributes While Learning Bayesian Networks (1996)

### Cached

### Download Links

Venue: | In Proc. ICML |

Citations: | 64 - 4 self |

### BibTeX

@INPROCEEDINGS{Friedman96discretizingcontinuous,

author = {Nir Friedman and Moises Goldszmidt},

title = {Discretizing Continuous Attributes While Learning Bayesian Networks},

booktitle = {In Proc. ICML},

year = {1996},

pages = {157--165},

publisher = {Morgan Kaufmann}

}

### Years of Citing Articles

### OpenURL

### Abstract

We introduce a method for learning Bayesian networks that handles the discretization of continuous variables as an integral part of the learning process. The main ingredient in this method is a new metric based on the Minimal Description Length principle for choosing the threshold values for the discretization while learning the Bayesian network structure. This score balances the complexity of the learned discretization and the learned network structure against how well they model the training data. This ensures that the discretization of each variable introduces just enough intervals to capture its interaction with adjacent variables in the network. We formally derive the new metric, study its main properties, and propose an iterative algorithm for learning a discretization policy. Finally, we illustrate its behavior in applications to supervised learning. 1 INTRODUCTION Bayesian networks provide efficient and effective representation of the joint probability distribution over a set ...

### Citations

8563 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...sed version of D. The idea is as follows: a network B assigns a probability to each instance of U. Using these probabilities we can construct an efficient code. In particular, we use the Huffman code =-=[5]-=-, which assigns shorter codes to frequent instances. The benefit of using the MDL as a scoring metric is that the best network for D optimally balances the complexity 1 Formally there is a notion of m... |

7053 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...sent direct dependencies between the variables. The graph structure G encodes the following set of independence assumptions: each node X i is independent of its non-descendants given its parents in G =-=[17]-=-. 1 The second component of the pair, namely \Theta, represents the set of parameters that quantifies the network. It contains a parameter ` x i j\Pi x i = P (x i j\Pi x i ) for each possible value x ... |

4937 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...on (i.e., the discretization with the empty threshold list), and search over the possible refinements. This approach, which is usually called top-down, is common in the supervised learning literature =-=[18, 7]-=-. The search strategy can be any of the well known ones, greedy search (or hill climb search), beam search, etc. Carrying out this search can be very expensive. Even for a simple greedy search strateg... |

1075 | Herskovitz: A Bayesian Method for the Induction
- Cooper, E
- 1992
(Show Context)
Citation Context ...laborious and expensive process in large applications. Thus, learning Bayesian networks from data has become a rapidly growing field of research that has seen a great deal of activity in recent years =-=[1, 2, 4, 10, 14]. The obje-=-ctive is to induce a network (or a set of networks) that "best describes" the probability distribution over the training data. This optimization process is implemented in practice using heur... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...laborious and expensive process in large applications. Thus, learning Bayesian networks from data has become a rapidly growing field of research that has seen a great deal of activity in recent years =-=[1, 2, 4, 10, 14]. The obje-=-ctive is to induce a network (or a set of networks) that "best describes" the probability distribution over the training data. This optimization process is implemented in practice using heur... |

741 |
Aha, “UCI repository of machine learning data bases,” http: //www.ics.uci.edu/~mlearn/MLRepository.html
- Murphy, W
- 1992
(Show Context)
Citation Context ...ELIMINARY EXPERIMENTAL RESULTS This section describes preliminary experiments designed to test the soundness of the method proposed. The experiments were run on 13 datasets from the Irvine repository =-=[15]. We estimated the a-=-ccuracy of the learned classifiers using 5-fold cross-validation, except for the "shuttle-small" and "waveform-21" datasets where we used the hold-out method. We report the mean of... |

654 |
K.B.: Multi-interval discretization of continuous-valued attributes for classification learning
- Fayyad, Irani
- 1993
(Show Context)
Citation Context ...new metric provides a principled approach for selecting the threshold values in the discretization process. Our proposal can be regarded as a generalization of the method proposed by Fayyad and Irani =-=[7]-=-. Roughly speaking, their approach, which applies to supervised learning only, discretizes variables to increase the mutual information with respect to the class variable. In fact, as we show experime... |

409 | Supervised and Unsupervised Discretization of Continuous Features
- Dougherty, Kohavi, et al.
(Show Context)
Citation Context ...ed with an application to supervised learning and a comparison to the discretization method of Fayyad and Irani [7] (FI from now on). This method is considered state of the art in supervised learning =-=[6]-=-. In their approach, FI attempt to maximize the mutualinformation between each variable and the class variable. Although their method was not developed in the context of Bayesian networks, it is appli... |

312 | Estimating continuous distributions in Bayesian classifiers
- John, Langley
- 1995
(Show Context)
Citation Context ...ten have continuous values. We have two basic approaches to deal with continuous variables: we can restrict ourselves to specific families of parametric distributions and use the methods described in =-=[11, 16, 12, 3]-=-, or we can discretize these variables and learn a network over the discretized domain. There is tradeoff between the two options. The first can model the conditional density of each variable in the n... |

298 | Learning Bayesian Networks: The
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ... a network with respect to the data, we normally introduce a scoring function, and to solve the optimization problem we usually rely on heuristic search techniques over the space of possible networks =-=[9]-=-. Several different scoring functions have been proposed in the literature [4, 10, 14]. In this paper we focus our attention on the MDL score [14]. This score is simple, very intuitive, and has proven... |

247 | Operations for learning with graphical models
- Buntine
- 1994
(Show Context)
Citation Context ...laborious and expensive process in large applications. Thus, learning Bayesian networks from data has become a rapidly growing field of research that has seen a great deal of activity in recent years =-=[1, 2, 4, 10, 14]. The obje-=-ctive is to induce a network (or a set of networks) that "best describes" the probability distribution over the training data. This optimization process is implemented in practice using heur... |

241 |
AutoClass: A bayesian classification system
- Cheeseman, Kelly, et al.
- 1988
(Show Context)
Citation Context ...ten have continuous values. We have two basic approaches to deal with continuous variables: we can restrict ourselves to specific families of parametric distributions and use the methods described in =-=[11, 16, 12, 3]-=-, or we can discretize these variables and learn a network over the discretized domain. There is tradeoff between the two options. The first can model the conditional density of each variable in the n... |

188 | Learning Bayesian Belief Networks : An Approach Based on
- Lam, Bacchus
- 1994
(Show Context)
Citation Context |

181 |
Connectionist learning of belief networks
- Neal
- 1992
(Show Context)
Citation Context ...ten have continuous values. We have two basic approaches to deal with continuous variables: we can restrict ourselves to specific families of parametric distributions and use the methods described in =-=[11, 16, 12, 3]-=-, or we can discretize these variables and learn a network over the discretized domain. There is tradeoff between the two options. The first can model the conditional density of each variable in the n... |

98 | eld, \Machine learning library in C
- Kohavi, Sommer
- 1996
(Show Context)
Citation Context ...an of the prediction accuracies over all cross-validation folds. We also report the standard deviation of the accuracies found in each fold. These computations were done using the MLC++ library. (See =-=[8, 13]-=- for more details.) Our first experiment is concerned with an application to supervised learning and a comparison to the discretization method of Fayyad and Irani [7] (FI from now on). This method is ... |

78 | M.: Building classifiers using Bayesian networks
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...an of the prediction accuracies over all cross-validation folds. We also report the standard deviation of the accuracies found in each fold. These computations were done using the MLC++ library. (See =-=[8, 13]-=- for more details.) Our first experiment is concerned with an application to supervised learning and a comparison to the discretization method of Fayyad and Irani [7] (FI from now on). This method is ... |

43 | Learning Bayesian networks: A unification for discrete and Gaussian domains
- Heckerman, Geiger
- 1995
(Show Context)
Citation Context |

12 |
Properties of Bayesian network learning algorithms
- Bouckaert
- 1994
(Show Context)
Citation Context |