Results 1 
7 of
7
Scaling Clustering Algorithms to Large Databases”, Microsoft Research Report
, 1998
"... Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this wor ..."
Abstract

Cited by 244 (5 self)
 Add to MetaCart
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular KMeans clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a samplingbased approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.
Mathematical Programming for Data Mining: Formulations and Challenges
 INFORMS Journal on Computing
, 1998
"... This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research ch ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research challenges, and outline opportunities for contributions by the optimization research communities. Towards these goals, we include formulations of the basic categories of data mining methods as optimization problems. We also provide examples of successful mathematical programming approaches to some data mining problems. keywords: data analysis, data mining, mathematical programming methods, challenges for massive data sets, classification, clustering, prediction, optimization. To appear: INFORMS: Journal of Compting, special issue on Data Mining, A. Basu and B. Golden (guest editors). Also appears as Mathematical Programming Technical Report 9801, Computer Sciences Department, University of Wi...
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
Taming the Giants and the Monsters: Mining Large Databases for Nuggets of Knowledge
, 1998
"... enge. Consider a simple applications that determines if two rows in a data table are likely to be the same, given that it is acceptable for "a few fields" to differ. While this "findsimilar" problem appears simple, and one could think of several ways to achieve it, executing it on a massive data st ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
enge. Consider a simple applications that determines if two rows in a data table are likely to be the same, given that it is acceptable for "a few fields" to differ. While this "findsimilar" problem appears simple, and one could think of several ways to achieve it, executing it on a massive data store is far from straightforward. Large data stores are now a fact of life for most organizations. A gigabyte is a quantity of information; it represents about 10 9 bytes of stored information. The word derives from the Latin giga, meaning "giant." The next unit up is the terabyte, from the Greek teras, meaning "monster", represents 10 12 bytes. Quite appropriately, in certain database circles, the terabyte is also referred to as the "terrorbyte": a term I first heard used by Jim Gray. The modern information revolution is creating huge data stores which, instead of offering in
Knowledge Discovery From Distributed And Textual Data
 Hong Kong University of Science and Technology
, 1999
"... xvi 1) ..."
Bulletin of the Technical Committee on
"... this paper, as in [9], we draw a distinction between the latter, which we call KDD, and "data mining". The term data mining has been mostly used by statisticians, data analysts, and the database communities. The earliest uses of the term come from statistics and its usage in most settings was associ ..."
Abstract
 Add to MetaCart
this paper, as in [9], we draw a distinction between the latter, which we call KDD, and "data mining". The term data mining has been mostly used by statisticians, data analysts, and the database communities. The earliest uses of the term come from statistics and its usage in most settings was associated with negative connotations of blind exploration of data without a priori hypotheses to be verified. However, notable exceptions can be found. For example, as early as 1978 [16], the term is used in a positive sense in a demonstration of how generalized linear regression can be used to solve problems that are very difficult for humans and the traditional statistical techniques
RACHET: A New Algorithm for Clustering Multidimensional Distributed Datasets *
"... This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic) for analyzing multidimensional distributed data. A typical clustering algorithm requires bringing all the data in a centralized warehouse. This results in O (nd ..."
Abstract
 Add to MetaCart
This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic) for analyzing multidimensional distributed data. A typical clustering algorithm requires bringing all the data in a centralized warehouse. This results in O (nd) transmission cost, where n is the number of items and d is the number of features. For massive datasets, this is prohibitively expensive. In contrast, RACHET runs with at most O (n) time, space, and communication costs to build a global hierarchy of comparable clustering quality by merging locally generated clustering hierarchies. RACHET employs the encircling tactic in which the merges at each stage are chosen to minimize the volume of a covering hypersphere. For each cluster centroid, RACHET maintains descriptive statistics of constant complexity to enable these choices. RACHET’s framework is applicable to a wide class of centroidbased hierarchical clustering algorithms, such as centroid, medoid, and Ward. 1