Results 1  10
of
50
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 247 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Scaling Clustering Algorithms to Large Databasesâ€ť, Microsoft Research Report
, 1998
"... Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this wor ..."
Abstract

Cited by 244 (5 self)
 Add to MetaCart
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular KMeans clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a samplingbased approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.
Refining Initial Points for KMeans Clustering
, 1998
"... Practical approaches to clustering use an iterative procedure (e.g. KMeans, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition fro ..."
Abstract

Cited by 233 (5 self)
 Add to MetaCart
Practical approaches to clustering use an iterative procedure (e.g. KMeans, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition allows the iterative algorithm to converge to a "better" local minimum. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the popular KMeans clustering algorithm and show that refined initial starting points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the largescale cl...
Towards higher disk head utilization: extracting free bandwidth from busy disk drives
 Symposium on Operating Systems Design and Implementation
, 2000
"... Abstract Freeblock scheduling is a new approach to utilizing more of a disk's potential media bandwidth. By filling rotational latency periods with useful media transfers, 2050 % of a neveridle disk's bandwidth can often be provided to background applications with no effect on foreground response ..."
Abstract

Cited by 87 (18 self)
 Add to MetaCart
Abstract Freeblock scheduling is a new approach to utilizing more of a disk's potential media bandwidth. By filling rotational latency periods with useful media transfers, 2050 % of a neveridle disk's bandwidth can often be provided to background applications with no effect on foreground response times. This paper describes freeblock scheduling and demonstrates its value with simulation studies of two concrete applications: segment cleaning and data mining. Free segment cleaning often allows an LFS file system to maintain its ideal write performance when cleaning overheads would otherwise reduce performance by up to a factor of three. Free data mining can achieve over 47 full disk scans per day on an active transaction processing system, with no effect on its disk performance.
Mining Partially Periodic Event Patterns With Unknown Periods
 Proc. ICDE
, 2000
"... Periodic behavior is common in realworld applications. However, in many cases, periodicities are partial in that they are present only intermittently. Herein, we study such intermittent patterns, which we refer to as ppatterns. Our formulation of ppatterns takes into account imprecise time inf ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
Periodic behavior is common in realworld applications. However, in many cases, periodicities are partial in that they are present only intermittently. Herein, we study such intermittent patterns, which we refer to as ppatterns. Our formulation of ppatterns takes into account imprecise time information (e.g., due to unsynchronized clocks in distributed environments), noisy data (e.g., due to extraneous events), and shifts in phase and/or periods. We structure mining for ppatterns as two subtasks: (1) finding the periods of ppatterns and (2) mining temporal associations. For (2), a levelwise algorithm is used. For (1), we develop a novel approach based on a chisquared test, and study its performance in the presence of noise.
Mathematical Programming for Data Mining: Formulations and Challenges
 INFORMS Journal on Computing
, 1998
"... This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research ch ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research challenges, and outline opportunities for contributions by the optimization research communities. Towards these goals, we include formulations of the basic categories of data mining methods as optimization problems. We also provide examples of successful mathematical programming approaches to some data mining problems. keywords: data analysis, data mining, mathematical programming methods, challenges for massive data sets, classification, clustering, prediction, optimization. To appear: INFORMS: Journal of Compting, special issue on Data Mining, A. Basu and B. Golden (guest editors). Also appears as Mathematical Programming Technical Report 9801, Computer Sciences Department, University of Wi...
Alternatives to the kMeans Algorithm That Find Better Clusterings
"... We investigate here the behavior of the standard kmeans clustering algorithm and several alternatives to it: the k harmonic means algorithm due to Zhang and colleagues, fuzzy kmeans, Gaussian expectationmaximization, and two new variants of kharmonic means. Our aim is to nd which aspect ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
We investigate here the behavior of the standard kmeans clustering algorithm and several alternatives to it: the k harmonic means algorithm due to Zhang and colleagues, fuzzy kmeans, Gaussian expectationmaximization, and two new variants of kharmonic means. Our aim is to nd which aspects of these algorithms contribute to nding good clusterings, as opposed to converging to a lowquality local optimum. We describe each algorithm in a uni ed framework that introduces separate cluster membership and data weight functions.
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
RelationshipBased Clustering and Visualization for HighDimensional Data Mining
 INFORMS Journal on Computing
, 2002
"... In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity spac ..."
Abstract

Cited by 40 (10 self)
 Add to MetaCart
In several reallife datamining... This paper proposes a relationshipbased approach that alleviates both problems, sidestepping the "curseofdimensionality" issue by working in a suitable similarity space instead of the original highdimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graphpartitioningbased clustering techniques in this space. The output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While twodimensional visualization of a similarity matrix is by itself not novel, its combination with the ordersensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the highdimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters
Data Cleansing: Beyond Integrity Analysis
, 2000
"... The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
The paper analyzes the problem of data cleansing and automatically identifying potential errors in data sets. An overview of the diminutive amount of existing literature concerning data cleansing is given. Methods for error detection that go beyond integrity analysis are reviewed and presented. The applicable methods include: statistical outlier detection, pattern matching, clustering, and data mining techniques. Some brief results supporting the use of such methods are given. The future research directions necessary to address the data cleansing problem are discussed. Keywords: data cleansing, data cleaning, data quality, error detection.