Results 1 -
5 of
5
Very Fast EM-based Mixture Model Clustering Using Multiresolution kd-trees
- In Advances in Neural Information Processing Systems 11
, 1998
"... Clustering is importantinmany fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has bee ..."
Abstract
-
Cited by 80 (4 self)
- Add to MetaCart
Clustering is importantinmany fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has been dismissed as impractical (e.g. (Zhang, Ramakrishnan, & Livny, 1996)) for data mining of enormous datasets.
Bayesian Networks for Lossless Dataset Compression
- In Conference on Knowledge Discovery in Databases (KDD
, 1999
"... The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of larg ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of large datasets. We present algorithms for automatically learning Bayesian networks and new structures called "Huffman networks" that model statistical relationships in the datasets, and algorithms for using these models to then compress the datasets. These algorithms often achieve significantly better compression ratios than achieved with common dictionary-based algorithms such those used by programs like ZIP. 1 Introduction It has long been understood that even when confronted with a ten-gigabyte file containing data to be statistically analyzed, the actual information-theoretic amount of information in the file might be much less, perhaps merely a few hundred megabytes. This insight is curren...
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 1999
"... There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a singl ..."
Abstract
- Add to MetaCart
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. The Auton lab (at Carnegie Mellon University) and Schenley Park Research Inc. (a startup company), both jointly run by Andrew Moore and Jeff Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets.
Bayesian Network Search by Proxy
, 2011
"... Existing methods to search for an optimum Bayesian network su er when the size of the data set grows to be too large. The number of possible networks grows superexponentially in the number of variables, and it becomes increasingly time-consuming to get reasonable results; in fact, nding an exact opt ..."
Abstract
- Add to MetaCart
Existing methods to search for an optimum Bayesian network su er when the size of the data set grows to be too large. The number of possible networks grows superexponentially in the number of variables, and it becomes increasingly time-consuming to get reasonable results; in fact, nding an exact optimal network for a given data set is an NP-complete problem, so the question is often to nd a network which is good enough. However, as the numbers of instances and variables in the data set grow, the time to take even a single search step can get very costly. Searching by proxy can alleviate this problem; by selecting a random set of training samples and constructing an approximator around those, we can greatly reduce the time it takes to nd a network with a score comparable to that obtainable by the same search algorithm using exact scoring. Moreover, with enough training samples, we can obtain networks with signi cantly better scores in a fraction of the time. However, with too many samples, over tting occurs and the results do not improve as the number of samples increases. We conjecture that this is because the approximator smooths out the search landscape, making it less likely to get stuck in local minima, and give experimental evidence to support this. 1

