Results 1 - 10
of
40
OPTICS: Ordering Points To Identify the Clustering Structure
, 1999
"... Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of ..."
Abstract
-
Cited by 262 (42 self)
- Add to MetaCart
Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-data sets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficiently extract not only ‘traditional ’ clustering information (e.g. representative points, arbitrary shaped clusters), but also the intrinsic clustering structure. For medium sized data sets, the cluster-ordering can be represented graphically and for very large data sets, we introduce an appropriate visualization technique. Both are suitable for interactive exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.
On Clustering Validation Techniques
- Journal of Intelligent Information Systems
, 2001
"... Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Esp ..."
Abstract
-
Cited by 129 (1 self)
- Add to MetaCart
Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains.
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categoric ..."
Abstract
-
Cited by 109 (2 self)
- Add to MetaCart
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets
, 1999
"... Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding en ..."
Abstract
-
Cited by 56 (0 self)
- Add to MetaCart
Clustering techniques are used in database mining for finding interesting patterns in high dimensional data. These are useful in various applications of knowledge discovery in databases. Some challenges in clustering for large data sets in terms of scalability, data distribution, understanding end-results, and sensitivity to input order, have received attention in the recent past. Recent approaches attempt to find clusters embedded in subspaces of high dimensional data. In this paper we propose the use of adaptive grids for efficient and scalable computation of clusters in subspaces for large data sets and large number of dimensions. The bottom-up algorithm for subspace clustering computes the dense units in all dimensions and combines these to generate the dense units in higher dimensions. Computation is heavily dependent on the choice of the partitioning parameter chosen to partition each dimension into intervals (bins) to be tested for density. The number of bins determine...
Clustering Validity Assessment: Finding the optimal partitioning of a data set
, 2001
"... Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its val ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. As a consequence, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity.
Efficient k-anonymization using clustering techniques
- In DASFAA
, 2007
"... Abstract. k-anonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an ap ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Abstract. k-anonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an approach that uses the idea of clustering to minimize information loss and thus ensure good data quality. The key observation here is that data records that are naturally similar to each other should be part of the same equivalence class. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy heuristic, the complexity of which is in O(n 2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. 1
CLOPE: A Fast and Effective Clustering Algorithm for Transactional Data
- In: Proc of KDD’02
, 2002
"... This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, we develop a novel algorithm -- CLOPE, which is v ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
This paper studies the problem of categorical data clustering, especially for transactional data characterized by high dimensionality and large volume. Starting from a heuristic method of increasing the height-to-width ratio of the cluster histogram, we develop a novel algorithm -- CLOPE, which is very fast and scalable, while being quite effective. We demonstrate the performance of our algorithm on two real world datasets, and compare CLOPE with the state-of-art algorithms.
Cluster Analysis using Triangulation
, 1997
"... This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter- and intra-cluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the cluster ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter- and intra-cluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the clustering problem to be solved. Section 2 of the paper is a general introduction to clustering, which includes a brief description of the commonly used k-means technique. Following this is a discussion of the problems which arise in the k-means (and related) methods and why there is a need for graph-based methods. Sections 4 and 6 explain the proposed new method, and give examples of its success. Section 5 discusses a few existing graph-based methods and why they can be improved upon. The test programs, which provide the results discussed in this paper, are currently written for two dimensional data sets, but Section 7 explains how the same principles can be extended to higher dimensional problems.
A Fast and Robust General Purpose Clustering Algorithm
- In Pacific Rim International Conference on Artificial Intelligence
, 2000
"... General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very larg ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-Means has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multi-dimensional but is more robust to noise and outliers. We achieve this by using the discrete median rather than the mean as the estimator of the center of a cluster. Comparison with k-Means, Expectation Maximization and Gibbs sampling demonstrates the advantages of our algorithm.
Unsupervised and Semi-supervised Clustering: a Brief Survey
- in ‘A Review of Machine Learning Techniques for Processing Multimedia Content’, Report of the MUSCLE European Network of Excellence (FP6
, 2004
"... Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more \similar" to each other than they are to items in the other clusters. This notion of similarity can be expressed in very dierent ways, according to the purpose of the ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more \similar" to each other than they are to items in the other clusters. This notion of similarity can be expressed in very dierent ways, according to the purpose of the study, to domain-speci c assumptions and to prior knowledge of the problem. Clustering is usually performed when no information is available concerning the membership of data items to prede ned classes. For this reason, clustering is traditionally seen as part of unsupervised learning. We nevertheless speak here of unsupervised clustering to distinguish it from a more recent and less common approach that makes use of a small amount of supervision to \guide" or \adjust" clustering (see section 2). To support the extensive use of clustering in computer vision, pattern recognition, information retrieval, data mining, etc., very many dierent methods were developed in several communities. Detailed surveys of this domain can be found in [24], [26] or [25]. In the following, we attempt to brie y review a few core concepts of cluster analysis and describe categories of clustering methods that are best represented in the literature. We also take this opportunity to provide some pointers to more recent work on clustering.

