Results 1 
7 of
7
Very Fast EMbased Mixture Model Clustering Using Multiresolution kdtrees
 In Advances in Neural Information Processing Systems 11
, 1998
"... Clustering is importantinmany fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has bee ..."
Abstract

Cited by 90 (5 self)
 Add to MetaCart
Clustering is importantinmany fields including manufacturing, biology, finance, and astronomy. Mixture models are a popular approach due to their statistical foundations, and EM is a very popular method for finding mixture models. EM, however, requires many accesses of the data, and thus has been dismissed as impractical (e.g. (Zhang, Ramakrishnan, & Livny, 1996)) for data mining of enormous datasets.
Bayesian Networks for Lossless Dataset Compression
 IN CONFERENCE ON KNOWLEDGE DISCOVERY IN DATABASES (KDD
, 1999
"... The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of larg ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
The recent explosion in research on probabilistic data mining algorithms such as Bayesian networks has been focussed primarily on their use in diagnostics, prediction and efficient inference. In this paper, we examine the use of Bayesian networks for a different purpose: lossless compression of large datasets. We present algorithms for automatically learning Bayesian networks and new structures called "Huffman networks" that model statistical relationships in the datasets, and algorithms for using these models to then compress the datasets. These algorithms often achieve significantly better compression ratios than achieved with common dictionarybased algorithms such those used by programs like ZIP.
Bayesian Network Search by Proxy
, 2011
"... Existing methods to search for an optimum Bayesian network su er when the size of the data set grows to be too large. The number of possible networks grows superexponentially in the number of variables, and it becomes increasingly timeconsuming to get reasonable results; in fact, nding an exact opt ..."
Abstract
 Add to MetaCart
Existing methods to search for an optimum Bayesian network su er when the size of the data set grows to be too large. The number of possible networks grows superexponentially in the number of variables, and it becomes increasingly timeconsuming to get reasonable results; in fact, nding an exact optimal network for a given data set is an NPcomplete problem, so the question is often to nd a network which is good enough. However, as the numbers of instances and variables in the data set grow, the time to take even a single search step can get very costly. Searching by proxy can alleviate this problem; by selecting a random set of training samples and constructing an approximator around those, we can greatly reduce the time it takes to nd a network with a score comparable to that obtainable by the same search algorithm using exact scoring. Moreover, with enough training samples, we can obtain networks with signi cantly better scores in a fraction of the time. However, with too many samples, over tting occurs and the results do not improve as the number of samples increases. We conjecture that this is because the approximator smooths out the search landscape, making it less likely to get stuck in local minima, and give experimental evidence to support this. 1
Cached Sufficient Statistics for Automated Mining and Discovery from Massive Data Sources
, 1999
"... There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a singl ..."
Abstract
 Add to MetaCart
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. The Auton lab (at Carnegie Mellon University) and Schenley Park Research Inc. (a startup company), both jointly run by Andrew Moore and Jeff Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets.
Data Mining at CALDCMU: Tools, Experiences and Research Directions
, 1997
"... We describe the data mining problems and solutions that we have encountered in the Center for Automated Learning and Discovery (CALD) at CMU. Specifically, we describe these settings and their operational characteristics, describe our proposed solutions, list the performance results, and finally out ..."
Abstract
 Add to MetaCart
We describe the data mining problems and solutions that we have encountered in the Center for Automated Learning and Discovery (CALD) at CMU. Specifically, we describe these settings and their operational characteristics, describe our proposed solutions, list the performance results, and finally outline future research directions. 1 Introduction The Center for Automated Learning and Discovery (CALD) is a crossdisciplinary center at CMU, focusing on the research question "How can historical data be best used to improve future decisions?" Participants in CALD are drawn from diverse backgrounds, such as Computer Science (and specifically, Artificial Intelligence, Databases, Theory), Robotics, Statistics, Neurology, Philosophy, Engineering (Electrical, Civil, and Mechanical), Information Retrieval, and Language Processing. The Center involves industrial partners with challenging datamining problems. In this paper we describe some of these settings, recent research progress in our center...
Published In Knowledge Discovery from Databases (KDD '99),. Accelerating Exact kmeans Algorithms with Geometric Reasoning
, 1999
"... We present new algorithms for the kmeans clustering problem. They use the kdtree data structure to reduce the large number of nearestneighbor queries issued by the traditional algorithm. Sucient statistics are stored in the nodes of the kdtree. Then, an analysis of the geometry of the current cl ..."
Abstract
 Add to MetaCart
We present new algorithms for the kmeans clustering problem. They use the kdtree data structure to reduce the large number of nearestneighbor queries issued by the traditional algorithm. Sucient statistics are stored in the nodes of the kdtree. Then, an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional kmeans algorithm. Proofs of correctness are included. The kdtree can also be used to initialize the kmeans starting centers eciently. Our algorithms can be easily extended to provide fast ways of computing the error of a given cluster assignment, regardless of the method in which those clusters were obtained. We also show how to use them in a setting which allows approximate clustering results, with the benet of running faster. We have implemented and tested our algorithms on both real and simulated data. Results show a speedup factor of up to 170 on real astrophysical data, and superiority over the naive algorithm on simulated data in up to 5 dimensions. Our algorithms scale well with respect to the number of points and number of centers, allowing for clustering with tens of thousands of centers. 1 1