Results 1  10
of
20
Selectivity Estimation using Probabilistic Models
, 2001
"... Estimating the result size of complex queries that involve selection on multiple attributes and the join of several relations is a difficult but fundamental task in database query processing. It arises in costbased query optimization, query profiling, and approximate query answering. In this paper, ..."
Abstract

Cited by 82 (3 self)
 Add to MetaCart
Estimating the result size of complex queries that involve selection on multiple attributes and the join of several relations is a difficult but fundamental task in database query processing. It arises in costbased query optimization, query profiling, and approximate query answering. In this paper, we show how probabilistic graphical models can be effectively used for this task as an accurate and compact approximation of the joint frequency distribution of multiple attributes across multiple relations. Probabilistic Relational Models (PRMs) are a recent development that extends graphical statistical models such as Bayesian Networks to relational domains. They represent the statistical dependencies between attributes within a table, and between attributes across foreignkey joins. We provide an efficient algorithm for constructing a PRM from a database, and show how a PRM can be used to compute selectivity estimates for a broad class of queries. One of the major contributions of this work is a unified framework for the estimation of queries involving both select and foreignkey join operations. Furthermore, our approach is not limited to answering a small set of predetermined queries; a single model can be used to effectively estimate the sizes of a wide collection of potential queries across multiple tables. We present results for our approach on several realworld databases. For both singletable multiattribute queries and a general class of selectjoin queries, our approach produces more accurate estimates than standard approaches to selectivity estimation, using comparable space and time.
A Comparative Study of Discretization Methods for NaiveBayes Classifiers
 In Proceedings of PKAW 2002: The 2002 Pacific Rim Knowledge Acquisition Workshop
, 2002
"... Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naiveBayes learning and many other learning algorithms. We evaluate the effectiveness with naiveBayes classifiers of nine discretizati ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
Discretization is a popular approach to handling numeric attributes in machine learning. We argue that the requirements for effective discretization differ between naiveBayes learning and many other learning algorithms. We evaluate the effectiveness with naiveBayes classifiers of nine discretization methods, equal width discretization (EWD), equal frequency discretization (EFD), fuzzy discretization (FD), entropy minimization discretization (EMD), iterative discretization (ID), proportional kinterval discretization (PKID), lazy discretization (LD), nondisjoint discretization (NDD) and weighted proportional kinterval discretization (WPKID). It is found that in general naiveBayes classifiers trained on data preprocessed by LD, NDD or WPKID achieve lower classification error than those trained on data preprocessed by the other discretization methods. But LD can not scale to large data. This study leads to a new discretization method, weighted nondisjoint discretization (WNDD) that combines WPKID and NDD's advantages. Our experiments show that among all the rival discretization methods, WNDD best helps naiveBayes classifiers reduce average classification error.
A Latent Variable Model for Multivariate Discretization
, 1999
"... We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only. 1 Introduction The discretization of continuous variables a ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only. 1 Introduction The discretization of continuous variables arises as an issue in machine learning because of the availability of machine learning algorithms that can handle discrete variables only. Furthermore, even when the learning algorithm at hand can directly model continuous variables, it may be possible to improve its predictive performance, as well as its induction time, by using discretized variables [6, 10]. While many of the available discretization algorithms search for the best discretization of each continuous variable individually, an approach that we refer to as univariate discretization, ideally the discretization of a continuous variable should be carried out so as to minimize the loss of information that the given variable may con...
Mixnets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables
, 2000
"... Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in lowdimensional continuous spaces. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kdtrees ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in lowdimensional continuous spaces. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kdtrees (Moore, 1999). In this paper, we propose a kind of Bayesian network in which lowdimensional mixtures of Gaussians over different subsets of the domain’s variables are combined into a coherent joint probability model over the entire domain. The network is also capable of modeling complex dependencies between discrete variables and continuous variables without requiring discretization of the continuous variables. We present efficient heuristic algorithms for automatically learning these networks from data, and perform comparative experiments illustrating how well these networks model real scientific data and synthetic data. We also briefly discuss some possible improvements to the networks, as well as possible applications.
Interpolating conditional density trees
 A. Darwiche, N. Friedman (Eds.), Uncertainty in Artificial Intelligence
, 2002
"... Joint distributions over many variables are frequently modeled by decomposing them into products of simpler, lowerdimensional conditional distributions, such as in sparsely connected Bayesian networks. However, automatically learning such models can be very computationally expensive when there are ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Joint distributions over many variables are frequently modeled by decomposing them into products of simpler, lowerdimensional conditional distributions, such as in sparsely connected Bayesian networks. However, automatically learning such models can be very computationally expensive when there are many datapoints and many continuous variables with complex nonlinear relationships, particularly when no good ways of decomposing the joint distribution are known a priori. In such situations, previous research has generally focused on the use of discretization techniques in which each continuous variable has a single discretization that is used throughout the entire network. In this paper, we present and compare a wide variety of treebased algorithms for learning and evaluating conditional density estimates over continuous variables. These trees can be thought of as discretizations that vary according to the particular interactions being modeled; however, the density within a given leaf of the tree need not be assumed constant, and we show that such nonuniform leaf densities lead to more accurate density estimation. We have developed Bayesian network structurelearning algorithms that employ these treebased conditional density representations, and we show that they can be used to practically learn complex joint probability models over dozens of continuous variables from thousands of datapoints. We focus on nding models that are simultaneously accurate, fast to learn, and fast to evaluate once they are learned.
A Comparison of Bayesian Network Learning Algorithms from Continuous Data
 in AMIA
, 2005
"... Abstract. Learning a Bayesian network from data is an important problem in biomedicine for the automatic construction of decision support systems and inference of plausible causal relations. Most Bayesian network learning algorithms require discrete data; however discretization may impact the qualit ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. Learning a Bayesian network from data is an important problem in biomedicine for the automatic construction of decision support systems and inference of plausible causal relations. Most Bayesian network learning algorithms require discrete data; however discretization may impact the quality of the learned structure. In this project, we present a comparison of different approaches for learning from continuous data to identify the most promising one and to quantify the impact of discretization in Bayesian network learning. Problem Description. Despite the wide applicability of Bayesian networks in biomedicine, the fact that most Bayesian network structure learning algorithms require discrete data is a limitation since biomedical and biological data are routinely continuous. Studies usually employ simple discretization techniques such as frequencybased partitions. By neglecting to adequately address the ramifications of discretization, researchers unknowingly may lose information such as interactions and dependencies between variables and impact the learned structure. Unfortunately, there is no consensus on a standard procedure for discretization. Consequently, it is still an unresolved research question as how to best handle continuous data. There are three typical approaches to learning network structure with continuous data. First, data can be discretized prior to and independent from the application of the learning algorithm. Second, the discretization can be integrated into the learning phase in an effort to exploit the synergies. Algorithms following this approach output a discretization of the input variables and the network structure. Third, learning can be done directly with continuous data without committing to a specific discretization for the variables. Purpose. This project has two major components. First, it comprehensively compares the three different approaches in order to ascertain the relative strengths and weaknesses of each and to quantify the impact of discretization in network learning. Secondly, it presents a toolkit of discretization and learning techniques for use by biomedical researchers. The specific algorithms that are compared are:
Fast Factored Density Estimation and Compression with Bayesian Networks
"... Gaussian mixture models, compression, interpolating density trees, conditional density estimation To my family  especially my father, Donald. iv Many important data analysis tasks can be addressed by formulating them as probability estimation problems. For example, a popular general approach to au ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Gaussian mixture models, compression, interpolating density trees, conditional density estimation To my family  especially my father, Donald. iv Many important data analysis tasks can be addressed by formulating them as probability estimation problems. For example, a popular general approach to automatic classication problems is to learn a probabilistic model of each class from data in which the classes are known, and then use Bayes's rule with these models to predict the correct classes of other data for which they are not known. Anomaly detection and scientic discovery tasks can often be addressed by learning probability models over possible events and then looking for events to which these models assign low probabilities. Many data compression algorithms such as Human coding and arithmetic coding rely on probabilistic models of the data
Dynamic Bayesian Networks for Classification of Business Cycles
, 1999
"... We use Dynamic Bayesian networks to classify business cycle phases. We compare classifiers generated by learning the Dynamic Bayesian network structure on different sets of admissible network structures. Included are sets of network structures of the Tree Augmented Naive Bayes (TAN) classifiers of F ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We use Dynamic Bayesian networks to classify business cycle phases. We compare classifiers generated by learning the Dynamic Bayesian network structure on different sets of admissible network structures. Included are sets of network structures of the Tree Augmented Naive Bayes (TAN) classifiers of Friedman, Geiger, and Goldszmidt (1997) adapted for dynamic domains. The performance of the developed classifiers on the given data was modest.
Predictive discretization during model selection
 In Pattern Recognition, Springer LNCS 3175
, 2004
"... Abstract. We present an approach to discretizing multivariate continuous data while learning the structure of a graphical model. We derive a joint scoring function from the principle of predictive accuracy, which inherently ensures the optimal tradeoff between goodness of fit and model complexity ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We present an approach to discretizing multivariate continuous data while learning the structure of a graphical model. We derive a joint scoring function from the principle of predictive accuracy, which inherently ensures the optimal tradeoff between goodness of fit and model complexity including the number of discretization levels. Using the socalled finest grid implied by the data, our scoring function depends only on the number of data points in the various discretization levels (independent of the metric used in the continuous space). Our experiments with artificial data as well as with gene expression data show that discretization plays a crucial role regarding the resulting network structure. 1