Results 1  10
of
53
OPTICS: Ordering Points To Identify the Clustering Structure
, 1999
"... Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of ..."
Abstract

Cited by 340 (45 self)
 Add to MetaCart
Cluster analysis is a primary method for database mining. It is either used as a standalone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the wellknown clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many realdata sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its densitybased clustering structure. This clusterordering contains information which is equivalent to the densitybased clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only ‘traditional ’ clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the clusterordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.
On Clustering Validation Techniques
 Journal of Intelligent Information Systems
, 2001
"... Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Esp ..."
Abstract

Cited by 180 (2 self)
 Add to MetaCart
Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets
, 1999
"... Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding en ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding endresults, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this paper we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottomup algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determine...
Clustering Validity Assessment: Finding the optimal partitioning of a data set
, 2001
"... Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its val ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity.
Efficient kanonymization using clustering techniques
 In DASFAA
, 2007
"... Abstract. kanonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an ap ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
Abstract. kanonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an approach that uses the idea of clustering to minimize information loss and thus ensure good data quality. The key observation here is that data records that are naturally similar to each other should be part of the same equivalence class. We thus formulate a specific clustering problem, referred to as kmember clustering problem. We prove that this problem is NPhard and present a greedy heuristic, the complexity of which is in O(n 2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. 1
CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data
 In: Proc of KDD’02
, 2002
"... This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the heighttowidth ratio of the cluster histogram, we develop a novel algorithm  CLOPE, which is v ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the heighttowidth ratio of the cluster histogram, we develop a novel algorithm  CLOPE, which is very fast and scalable, while being quite effective. We demonstrate the performance of our algorithm on two real world datasets, and compare CLOPE with the stateofart algorithms.
Unsupervised and semisupervised clustering: a brief survey
 7th ACM SIGMM international workshop on Multimedia information retrieval
"... Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar ” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of the study, to domainspecific assumptions and to prior knowledge of the problem. Clustering is usually performed when no information is available concerning the membership of data items to predefined classes. For this reason, clustering is traditionally seen as part of unsupervised learning. We nevertheless speak here of unsupervised clustering to distinguish it from a more recent and less common approach that makes use of a small amount of supervision to “guide ” or “adjust ” clustering (see section 2). To support the extensive use of clustering in computer vision, pattern recognition, information retrieval, data mining, etc., very many different methods were developed in several communities. Detailed surveys of this domain can be found in [25], [27] or [26]. In the following, we attempt to briefly review a few core concepts of cluster analysis and describe categories of clustering methods that are best represented in the literature. We also take this opportunity to provide some pointers to more recent work on clustering.
A Fast and Robust General Purpose Clustering Algorithm
 In Pacific Rim International Conference on Artificial Intelligence
, 2000
"... General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. kMeans has been adopted as the prototype of iterative modelbased clustering because of its speed, simplicity and capability to work within the format of very larg ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. kMeans has been adopted as the prototype of iterative modelbased clustering because of its speed, simplicity and capability to work within the format of very large databases. However, kMeans has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers. We achieve this by using the discrete median rather than the mean as the estimator of the center of a cluster. Comparison with kMeans, Expectation Maximization and Gibbs sampling demonstrates the advantages of our algorithm.
Cluster Analysis using Triangulation
, 1997
"... This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter and intracluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the cluster ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter and intracluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the clustering problem to be solved. Section 2 of the paper is a general introduction to clustering, which includes a brief description of the commonly used kmeans technique. Following this is a discussion of the problems which arise in the kmeans (and related) methods and why there is a need for graphbased methods. Sections 4 and 6 explain the proposed new method, and give examples of its success. Section 5 discusses a few existing graphbased methods and why they can be improved upon. The test programs, which provide the results discussed in this paper, are currently written for two dimensional data sets, but Section 7 explains how the same principles can be extended to higher dimensional problems.