Results 1  10
of
49
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 247 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
A data clustering algorithm for mining patterns from event logs
 in IEEE IPOM’03 Proceedings
, 2003
"... This material is posted here with permission from IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE ..."
Abstract

Cited by 35 (0 self)
 Add to MetaCart
This material is posted here with permission from IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to
Computing Clusters of Correlation Connected Objects
, 2004
"... The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or mor ..."
Abstract

Cited by 34 (10 self)
 Add to MetaCart
The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Wellknown methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and densitybased clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS.
Rapid Detection of Significant Spatial Clusters
 In KDD
, 2004
"... Given an NN grid of squares, where each square has a count c i j and an underlying population p i j , our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region 's total count C and t ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
Given an NN grid of squares, where each square has a count c i j and an underlying population p i j , our goal is to find the rectangular region with the highest density, and to calculate its significance by randomization. An arbitrary density function D, dependent on a region 's total count C and total population P, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff's spatial scan statistic D K to find the most significant spatial disease cluster. A naive approach to finding the maximum density region requires O(N ) time, and is generally computationally infeasible. We present a multiresolution algorithm which partitions the grid into overlapping regions using a novel overlapkd tree data structure, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufficiently dense regions, this method finds the maximum density region in ) time, in practice resulting in significant (202000x) speedups on both real and simulated datasets.
A Fast MultiResolution Method for Detection of Significant Spatial Overdensities
 Advances in Neural Information Processing Systems 16
, 2003
"... Given an N N grid of squares, where each square s ij has a count c ij and an underlying population p ij , our goal is to nd the square region S with the highest density, and to calculate the signi cance of this region by Monte Carlo testing. Any density measure D, which depends on the total count ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
Given an N N grid of squares, where each square s ij has a count c ij and an underlying population p ij , our goal is to nd the square region S with the highest density, and to calculate the signi cance of this region by Monte Carlo testing. Any density measure D, which depends on the total count and total population of the region, can be used. For example, if each count c ij represents the number of disease cases occurring in that square, we can use Kulldor's spatial scan statistic DK to nd the most signi cant spatial disease cluster. A naive approach to nding the region of maximum density would be to calculate the density measure for every square region: this requires O(RN ) calculations, where R is the number of Monte Carlo replications, and hence is generally computationally infeasible. We present a novel multiresolution algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For suciently dense regions, this method nds the maximum density region in optimal O(RN ) time, and in practice it results in signi cant (10200x) speedups as compared to the naive approach.
Detecting significant multidimensional spatial clusters
 Advances in Neural Information Processing Systems 17
, 2005
"... Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (ddimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (ddimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency department visits, overthecounter drug sales), and discovery of regions of increased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1
FINDIT: a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting
 PhD thesis, Korea Advanced Institute of Science and Technology
, 2002
"... The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of cluster ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
The aim of this paper is to present a novel subspace clustering method named FINDIT. Clustering is the process of finding interesting patterns residing in the dataset by grouping similar data objects from dissimilar ones based on their dimensional values. Subspace clustering is a new area of clustering which achieves the clustering goal in high dimension by allowing clusters to be formed with their own correlated dimensions. In subspace clustering, selecting correct dimensions is very important because the distance between points is easily changed according to the selected dimensions. However, to select dimensions correctly is difficult, because data grouping and dimension selecting should be performed simultaneously. FINDIT determines the correlated dimensions for each cluster based on two key ideas: dimensionoriented distance measure which fully utilizes dimensional difference information, and dimension voting policy which determines important dimensions in a probabilistic way based on V nearest neighbors ’ information. Through various experiments on synthetic data, FINDIT is shown to be very successful in the high dimensional clustering problem. FINDIT satisfies most requirements for good clustering methods such as accuracy of results, robustness to the noise and the cluster density, and scalability to the dataset size and the dimensionality. Moreover, it is gracefully scalable to full dimension without any modification to algorithm.
Comparing subspace clusterings
 IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of wellknown distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subsp ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Abstract—We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of wellknown distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subspace clusterings and give a systematic comparison of our proposed measures in terms of these properties. We validate the usefulness of our subspace clustering distance measures by comparing clusterings produced by the algorithms FastDOC, HARP, PROCLUS, ORCLUS, and SSPC. We show that our distance measures can be also used to compare partial clusterings, overlapping clusterings, and patterns in binary data matrices. Index Terms—Subspace clustering, projected clustering, distance, feature selection, cluster validation.
Hypergraph Models and Algorithms for DataPattern Based Clustering
 DATA MINING AND KNOWLEDGE DISCOVERY
, 2004
"... In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depend on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertextocluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
The challenges of clustering highdimensional data
 In New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition
, 2003
"... Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Cluster analysis divides data into groups (clusters) for the purposes of summarization or improved understanding. For example, cluster analysis has been used to group related documents for browsing, to find genes and proteins that have similar functionality, or as a means of data compression. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. In this chapter we provide a short introduction to cluster analysis, and then focus on the challenge of clustering high dimensional data. We present a brief overview of several recent techniques, including a more detailed description of recent work of our own which uses a conceptbased clustering approach. 1