Results 1  10
of
11
Scaling Clustering Algorithms to Large Databases”, Microsoft Research Report
, 1998
"... Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this wor ..."
Abstract

Cited by 244 (5 self)
 Add to MetaCart
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular KMeans clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a samplingbased approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empirically evaluate on synthetic and publicly available data sets.
Refining Initial Points for KMeans Clustering
, 1998
"... Practical approaches to clustering use an iterative procedure (e.g. KMeans, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition fro ..."
Abstract

Cited by 233 (5 self)
 Add to MetaCart
Practical approaches to clustering use an iterative procedure (e.g. KMeans, EM) which converges to one of numerous local minima. It is known that these iterative techniques are especially sensitive to initial starting conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition allows the iterative algorithm to converge to a "better" local minimum. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the popular KMeans clustering algorithm and show that refined initial starting points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the largescale cl...
Initialization of iterative refinement clustering algorithms
 In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD–98
, 1998
"... Iterative refinement clustering algorithms (e.g. KMeans, EM) converge to one of numerous local minima. It is known that they are especially sensitive to initial conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient techn ..."
Abstract

Cited by 66 (2 self)
 Add to MetaCart
Iterative refinement clustering algorithms (e.g. KMeans, EM) converge to one of numerous local minima. It is known that they are especially sensitive to initial conditions. We present a procedure for computing a refined starting condition from a given initial one that is based on an efficient technique for estimating the modes of a distribution. The refined initial starting condition leads to convergence to “better ” local minima. The procedure is applicable to a wide class of clustering algorithms for both discrete and continuous data. We demonstrate the application of this method to the Expectation Maximization (EM) clustering algorithm and show that refined initial points indeed lead to improved solutions. Refinement run time is considerably lower than the time required to cluster the full database. The method is scalable and can be coupled with a scalable clustering algorithm to address the largescale clustering in data mining. 1
Mathematical Programming for Data Mining: Formulations and Challenges
 INFORMS Journal on Computing
, 1998
"... This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research ch ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
This paper is intended to serve as an overview of a rapidly emerging research and applications area. In addition to providing a general overview, motivating the importance of data mining problems within the area of knowledge discovery in databases, our aim is to list some of the pressing research challenges, and outline opportunities for contributions by the optimization research communities. Towards these goals, we include formulations of the basic categories of data mining methods as optimization problems. We also provide examples of successful mathematical programming approaches to some data mining problems. keywords: data analysis, data mining, mathematical programming methods, challenges for massive data sets, classification, clustering, prediction, optimization. To appear: INFORMS: Journal of Compting, special issue on Data Mining, A. Basu and B. Golden (guest editors). Also appears as Mathematical Programming Technical Report 9801, Computer Sciences Department, University of Wi...
Scaling EM (ExpectationMaximization) Clustering to Large Databases
, 1999
"... Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Practical statistical clustering algorithms typically center upon an iterative refinement optimization procedure to compute a locally optimal clustering solution that maximizes the fit to data. These algorithms typically require many database scans to converge, and within each scan they require the access to every record in the data table. For large databases, the scans become prohibitively expensive. We present a scalable implementation of the ExpectationMaximization (EM) algorithm. The database community has focused on distancebased clustering schemes and methods have been developed to cluster either numerical or categorical data. Unlike distancebased algorithms (such as KMeans), EM constructs proper statistical models of the underlying data source and naturally generalizes to cluster databases containing both discretevalued and continuousvalued data. The scalable method is based on a decomposition of the basic statistics the algorithm needs: identifying regions of the data that...
Taming the Giants and the Monsters: Mining Large Databases for Nuggets of Knowledge
, 1998
"... enge. Consider a simple applications that determines if two rows in a data table are likely to be the same, given that it is acceptable for "a few fields" to differ. While this "findsimilar" problem appears simple, and one could think of several ways to achieve it, executing it on a massive data st ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
enge. Consider a simple applications that determines if two rows in a data table are likely to be the same, given that it is acceptable for "a few fields" to differ. While this "findsimilar" problem appears simple, and one could think of several ways to achieve it, executing it on a massive data store is far from straightforward. Large data stores are now a fact of life for most organizations. A gigabyte is a quantity of information; it represents about 10 9 bytes of stored information. The word derives from the Latin giga, meaning "giant." The next unit up is the terabyte, from the Greek teras, meaning "monster", represents 10 12 bytes. Quite appropriately, in certain database circles, the terabyte is also referred to as the "terrorbyte": a term I first heard used by Jim Gray. The modern information revolution is creating huge data stores which, instead of offering in
Knowledge Discovery From Distributed And Textual Data
 Hong Kong University of Science and Technology
, 1999
"... xvi 1) ..."
Improving the Performance of AudioBased Similarity Queries with Clustering
 In Acm First International Wprkshop on Multimedia Intelligent Storage and Retrieval Management
, 1999
"... Many multimedia applications require the storage and retrieval of nontraditional data types such as audio, video and images. One important functionality required by these applications is the capability to find objects in a database that are similar to a given object. The comparison algorithms for m ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Many multimedia applications require the storage and retrieval of nontraditional data types such as audio, video and images. One important functionality required by these applications is the capability to find objects in a database that are similar to a given object. The comparison algorithms for multimedia data types are typically computationally expensive. Therefore, the performance of similarity queries can be improved significantly by reducing the number of invocations of these comparison algorithms. In this paper, we propose the utilization of clustering techniques in order to reduce the number of invocations of comparison algorithms. Although clustering improves the performance of similarity queries, it might introduce inaccuracy in the results. We propose a family of similarity query execution techniques to strike a compromise between accuracy and performance. To evaluate the performance (i.e., query response time) and accuracy (i.e., precision and recall) of our similarity qu...
Improving the Performance of AudioBased Similarity Queries with Clustering
, 1999
"... Many multimedia applications require the storage and retrieval of nontraditional data types such as audio, video and images. One important functionality required by these applications is the capability to find objects in a database that is similar to a given object. The comparison algorithms for mu ..."
Abstract
 Add to MetaCart
Many multimedia applications require the storage and retrieval of nontraditional data types such as audio, video and images. One important functionality required by these applications is the capability to find objects in a database that is similar to a given object. The comparison algorithms for multimedia data types are typically computationally expensive. Therefore, the performance of similarity queries can be improved significantly by reducing the number of invocations of these comparison algorithms. In this paper, we propose the utilization of clustering techniques in order to reduce the number of invocations of comparison algorithms. We argue that our proposed approach is more suitable to be deployed at the database system level, specifically for objectrelational database management systems (ORDBMS), as compared to special purpose indexing techniques. This is because our approach is flexible enough to utilize any given comparison algorithm without knowing its detail implementat...
Bulletin of the Technical Committee on
"... this paper, as in [9], we draw a distinction between the latter, which we call KDD, and "data mining". The term data mining has been mostly used by statisticians, data analysts, and the database communities. The earliest uses of the term come from statistics and its usage in most settings was associ ..."
Abstract
 Add to MetaCart
this paper, as in [9], we draw a distinction between the latter, which we call KDD, and "data mining". The term data mining has been mostly used by statisticians, data analysts, and the database communities. The earliest uses of the term come from statistics and its usage in most settings was associated with negative connotations of blind exploration of data without a priori hypotheses to be verified. However, notable exceptions can be found. For example, as early as 1978 [16], the term is used in a positive sense in a demonstration of how generalized linear regression can be used to solve problems that are very difficult for humans and the traditional statistical techniques