Results 1  10
of
45
ModelBased Clustering, Discriminant Analysis, and Density Estimation
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract

Cited by 319 (26 self)
 Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for modelbased clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Compressed Data Cubes for OLAP Aggregate Query Approximation on Continuous Dimensions
, 1988
"... Efficiently answering decision support queries is an important problem. Most of the work in this direction has been in the context of the data cube. Queries are efficiently answered by precomputing large parts of the cube. Besides having large space requirements, such precomputation requires that ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
(Show Context)
Efficiently answering decision support queries is an important problem. Most of the work in this direction has been in the context of the data cube. Queries are efficiently answered by precomputing large parts of the cube. Besides having large space requirements, such precomputation requires that the hierarchy along each dimension be fixed (hence dimensions are categorical or prediscretized) . Queries that take advantage of precomputation can thus only drilldown or rollup along this fixed hierarchy. Another disadvantage of existing precomputation techniques is that the target measure, along with the aggregation function of interest, is fixed for each cube. Queries over more than one target measure or using different aggregation functions, would require precomputing larger data cubes. In this paper, we propose a new compressed representation of the data cube that (a) drastically reduces storage requirements, (b) does not require the discretization hierarchy along each query dimension to be fixed beforehand and (c) treats each dimension as a potential target measure and supports multiple aggregation functions without additional storage costs. The tradeoff is approximate, yet relatively accurate, answers to queries. We outline mechanisms to reduce the error in the approximation. Our performance evaluation indicates that our compression technique effectively addresses the limitations of existing approaches.
Efficient discovery of errortolerant frequent itemsets in high dimensions
 In SIGKDD 2001
, 2001
"... We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
(Show Context)
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies errortolerant frequent clusters of items in transactional data (customerpurchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three realworld applications: clustering highdimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.
Clustering for approximate similarity search in highdimensional spaces
 IEEE Transactions on Knowledge and Data Engineering
, 2002
"... AbstractÐIn this paper, we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few ne ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
(Show Context)
AbstractÐIn this paper, we present a clustering and indexing paradigm (called Clindex) for highdimensional search spaces. The scheme is designed for approximate similarity searches, where one would like to find many of the data points near a target point, but where one can tolerate missing a few near points. For such searches, our scheme can find near points with high recall in very few IOs and perform significantly better than other approaches. Our scheme is based on finding clusters and, then, building a simple but efficient index for them. We analyze the tradeoffs involved in clustering and building such an index structure, and present extensive experimental results. Index TermsÐApproximate search, clustering, highdimensional index, similarity search. 1
Modeling Spatial Dependencies for Mining Geospatial Data: An Introduction
 Geographic data mining and Knowledge Discovery (GKD
, 2000
"... Spatial data mining is a process to discover interesting, potentially useful and high utility patterns embedded in spatial databases. Efficient tools for extracting information from spatial data sets can be of importance to organizations which own, generate and manage large spatial data sets. The ..."
Abstract

Cited by 33 (10 self)
 Add to MetaCart
(Show Context)
Spatial data mining is a process to discover interesting, potentially useful and high utility patterns embedded in spatial databases. Efficient tools for extracting information from spatial data sets can be of importance to organizations which own, generate and manage large spatial data sets. The current approach towards solving spatial data mining problems is to use classical data mining tools after "materializing" spatial relationships. However, the key property of spatial data is that of spatial autocorrelation. Like temporal data, spatial data values are influenced by values in their immediate vicinity. Ignoring spatial autocorrelation in the modeling process leads to results which are a poorfit and unreliable. In this chapter we will first review spatial statistical techniques which explictly model spatial autocorrelation. Second, we will propose PLUMS(Predicting Locations Using Map Similarity), a new approach for supervised spatial data mining problems. PLUMS searches the space of solutions using a mapsimilarity measure which is more appropriate in the context of spatial data. We will show that compared to stateoftheart spatial statistics approaches, PLUMS achives comparable accuracy but at a fraction of the computational cost. Furthermore, PLUMS provides a general framework for specializing other data mining techniques for mining spatial data.
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Datasets
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2003
"... We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In densitybiased sampling, the probability that a given point will be included in ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In densitybiased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for densitybiased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present...
Knowledge Discovery from Sequential Data
, 2003
"... A new framework for analyzing sequential or temporal data such as time series is proposed. It differs from other approaches by the special emphasis on the interpretability of the results, since interpretability is of vital importance for knowledge discovery, that is, the development of new knowl ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
A new framework for analyzing sequential or temporal data such as time series is proposed. It differs from other approaches by the special emphasis on the interpretability of the results, since interpretability is of vital importance for knowledge discovery, that is, the development of new knowledge (in the head of a human) from a list of discovered patterns. While traditional approaches try to model and predict all time series observations, the focus in this work is on modelling local dependencies in multivariate time series. This
Hierarchical ModelBased Clustering of Large Datasets Through Fractionation and Refractionation.
 Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD02
, 2002
"... The goal of clustering is to identify distinct groups in a dataset. Compared to nonparametric clustering methods like complete linkage, hierarchical modelbased clustering has the advantage of offering a way to estimate the number of groups present in the data. However, its computational cost is qu ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
The goal of clustering is to identify distinct groups in a dataset. Compared to nonparametric clustering methods like complete linkage, hierarchical modelbased clustering has the advantage of offering a way to estimate the number of groups present in the data. However, its computational cost is quadratic in the number of items to be clustered, and it is therefore not applicable to large problems. We review an idea called Fractionation, originally conceived by Cutting, Karger, Pedersen and Tukey for nonparametric hierarchical clustering of large datasets, and describe an adaptation of Fractionation to modelbased clustering. A further extension, called Refractionation, leads to a procedure that can be successful even in the difficult situation where there are large numbers of small groups.
Scalable clustering algorithms with balancing constraints
 Data Mining Knowledge Discovery
"... Abstract. Clustering methods for datamining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clus ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering methods for datamining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including
DynDex: A Dynamic and Nonmetric Space Indexer
, 2002
"... To date, almost all research work in the ContentBased Image Retrieval (CBIR) community has used Minkowskilike functions to measure similarity between images. In this paper, we first present a nonmetric distance function, dynamic partial function (DPF), which works significantly better than Minkow ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
To date, almost all research work in the ContentBased Image Retrieval (CBIR) community has used Minkowskilike functions to measure similarity between images. In this paper, we first present a nonmetric distance function, dynamic partial function (DPF), which works significantly better than Minkowskilike functions for measuring perceptual similarity; and we explain DPF’s link to similarity theories in cognitive science. We then propose DynDex, an indexing method that deals with both the dynamic and nonmetric aspects of the distance function. DynDex employs statistical methods including distancebased classification and bagging to enable efficient indexing with DPF. In addition to its efficiency for conducting similarity searches in very highdimensional spaces, we show that DynDex remains effective when features are weighted dynamically for supporting personalized searches.