Results 1  10
of
36
A Framework for Clustering Evolving Data Streams
 In VLDB
, 2003
"... The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a... ..."
Abstract

Cited by 242 (32 self)
 Add to MetaCart
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a...
Using the Triangle Inequality to Accelerate kMeans
, 2003
"... The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract

Cited by 100 (1 self)
 Add to MetaCart
The kmeans algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated kmeans method.
Learning the k in kmeans
 In Proc. 17th NIPS
, 2003
"... When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The Gmeans algorithm is based on a statistical test for the hypothesis t ..."
Abstract

Cited by 85 (6 self)
 Add to MetaCart
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The Gmeans algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. Gmeans runs kmeans with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each kmeans center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, Gmeans only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does
A Framework for Projected Clustering of High Dimensional Data Streams
 IN PROC. OF VLDB
, 2004
"... The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, singlescan, stream analysis methods have been propo ..."
Abstract

Cited by 58 (10 self)
 Add to MetaCart
The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, singlescan, stream analysis methods have been proposed in this context. However,
Simpoint 3.0: Faster and more flexible program analysis
 Journal of Instruction Level Parallelism
, 2005
"... This paper describes the new features available in the SimPoint 3.0 release. The release provides two techniques for drastically reducing the runtime of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an opti ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
This paper describes the new features available in the SimPoint 3.0 release. The release provides two techniques for drastically reducing the runtime of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an option to output only the simulation points that represent the majority of execution, which can reduce simulation time without much increase in error. Finally, this release provides support for correctly clustering variable length intervals, taking into consideration the weight of each interval during clustering. This paper describes SimPoint 3.0’s new features, how to use them, and points out some common pitfalls. 1
Clustering Binary Data Streams with Kmeans
 In 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
, 2003
"... Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that finds higher quality solutio ..."
Abstract

Cited by 43 (1 self)
 Add to MetaCart
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a meanbased initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained online. The Kmeans variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.
Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000
 In Knowledge Discovery and Data Mining
, 2001
"... CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used b ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very timeconsuming model search. In either case, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.
Motivation for variable length intervals and hierarchical phase behavior
 In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior tec ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program’s periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program’s actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint. 1
FREM: Fast and Robust EM Clustering for Large Data Sets
 In ACM CIKM Conference
, 2002
"... Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm can find a good clustering solution in 3 scans over the data set. Alternatively, it can be run until it converges. The algorithm has a few parameters that are easy to set and have defaults for most cases. The proposed algorithm is compared against the standard EM algorithm and the OnLine EM algorithm.