Results 1 
5 of
5
Very Fast EMbased Mixture Model Clustering Using Multiresolution kdtrees
 In Advances in Neural Information Processing Systems 11
, 1998
"... Clustering is importantinmany fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has bee ..."
Abstract

Cited by 89 (4 self)
 Add to MetaCart
Clustering is importantinmany fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has been dismissed as impractical (e.g. (Zhang, Ramakrishnan, & Livny, 1996)) for data mining of enormous datasets.
Bayesian Networks for Lossless Dataset Compression
 IN CONFERENCE ON KNOWLEDGE DISCOVERY IN DATABASES (KDD
, 1999
"... The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of larg ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of large datasets. We present algorithms for automatically learning Bayesian networks and new structures called "Huffman networks" that model statistical relationships in the datasets, and algorithms for using these models to then compress the datasets. These algorithms often achieve significantly better compression ratios than achieved with common dictionarybased algorithms such those used by programs like ZIP.
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 1999
"... There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a singl ..."
Abstract
 Add to MetaCart
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. The Auton lab (at Carnegie Mellon University) and Schenley Park Research Inc. (a startup company), both jointly run by Andrew Moore and Jeff Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets.
Bayesian Network Search by Proxy
, 2011
"... Existing methods to search for an optimum Bayesian network su er when the size of the data set grows to be too large. The number of possible networks grows superexponentially in the number of variables, and it becomes increasingly timeconsuming to get reasonable results; in fact, nding an exact opt ..."
Abstract
 Add to MetaCart
Existing methods to search for an optimum Bayesian network su er when the size of the data set grows to be too large. The number of possible networks grows superexponentially in the number of variables, and it becomes increasingly timeconsuming to get reasonable results; in fact, nding an exact optimal network for a given data set is an NPcomplete problem, so the question is often to nd a network which is good enough. However, as the numbers of instances and variables in the data set grow, the time to take even a single search step can get very costly. Searching by proxy can alleviate this problem; by selecting a random set of training samples and constructing an approximator around those, we can greatly reduce the time it takes to nd a network with a score comparable to that obtainable by the same search algorithm using exact scoring. Moreover, with enough training samples, we can obtain networks with signi cantly better scores in a fraction of the time. However, with too many samples, over tting occurs and the results do not improve as the number of samples increases. We conjecture that this is because the approximator smooths out the search landscape, making it less likely to get stuck in local minima, and give experimental evidence to support this. 1