Results 1 - 10
of
30
A Framework for Clustering Evolving Data Streams
- In VLDB
, 2003
"... The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a... ..."
Abstract
-
Cited by 156 (15 self)
- Add to MetaCart
The clustering problem is a difficult problem for the data stream domain. This is because the large volumes of data arriving in a stream renders most traditional algorithms too inefficient. In recent years, a...
Using the Triangle Inequality to Accelerate k-Means
, 2003
"... The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k>=20 it is many times faster than the best previously known accelerated k-means method.
Learning the k in k-means
- In Proc. 17th NIPS
, 2003
"... When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis t ..."
Abstract
-
Cited by 64 (5 self)
- Add to MetaCart
When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does
A Framework for Projected Clustering of High Dimensional Data Streams
- IN PROC. OF VLDB
, 2004
"... The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been propo ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
The data stream problem has been studied extensively in recent years, because of the great ease in collection of stream data. The nature of stream data makes it essential to use algorithms which require only one pass over the data. Recently, single-scan, stream analysis methods have been proposed in this context. However,
Simpoint 3.0: Faster and more flexible program analysis
- Journal of Instruction Level Parallelism
, 2005
"... This paper describes the new features available in the Sim-Point 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an opti ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
This paper describes the new features available in the Sim-Point 3.0 release. The release provides two techniques for drastically reducing the run-time of SimPoint: faster searching to find the best clustering, and efficiently clustering large numbers of intervals. SimPoint 3.0 also provides an option to output only the simulation points that represent the majority of execution, which can reduce simulation time without much increase in error. Finally, this release provides support for correctly clustering variable length intervals, taking into consideration the weight of each interval during clustering. This paper describes SimPoint 3.0’s new features, how to use them, and points out some common pitfalls. 1
Clustering Binary Data Streams with K-means
- In 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
, 2003
"... Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable Kmeans, and Incremental K-means, a proposed variant introduced that finds higher quality solutio ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable Kmeans, and Incremental K-means, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.
Motivation for variable length intervals and hierarchical phase behavior
- In IEEE International Symposium on Performance Analysis of Systems and Software
, 2005
"... Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior tec ..."
Abstract
-
Cited by 21 (6 self)
- Add to MetaCart
Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group similar portions of a program’s execution into phases, where the intervals in each phase have homogeneous behavior and similar resource requirements. These prior techniques focus on fixed length intervals (such as a hundred million instructions) to find phase behavior. Fixed length intervals can make a program’s periodic phase behavior difficult to find, because the fixed interval length can be out of sync with the period of the program’s actual phase behavior. In addition, a fixed interval length can only express one level of phase behavior. In this paper, we graphically show that there exists a hierarchy of phase behavior in programs and motivate the need for variable length intervals. We describe the changes applied to SimPoint to support variable length intervals. We finally conclude by providing an initial study into using variable length intervals to guide SimPoint. 1
Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000
- In Knowledge Discovery and Data Mining
, 2001
"... CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used b ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either case, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering useful correlations.
FREM: Fast and Robust EM Clustering for Large Data Sets
- In ACM CIKM Conference
, 2002
"... Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm can find a good clustering solution in 3 scans over the data set. Alternatively, it can be run until it converges. The algorithm has a few parameters that are easy to set and have defaults for most cases. The proposed algorithm is compared against the standard EM algorithm and the On-Line EM algorithm.

