Results 1  10
of
33
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1284 (13 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
 In Research Issues on Data Mining and Knowledge Discovery
, 1997
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 82 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of ...
A music retrieval system based on userdriven similarity and its evaluation
 In Proc. International Symposium on Music Information Retrieval
, 2005
"... Large music collections require new ways to let users interact with their music. The concept of finding ‘similar’ songs, albums, or artists provides handles to users for easy navigation and instant retrieval. This paper presents the realization and user evaluation of a music retrieval music that sor ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
Large music collections require new ways to let users interact with their music. The concept of finding ‘similar’ songs, albums, or artists provides handles to users for easy navigation and instant retrieval. This paper presents the realization and user evaluation of a music retrieval music that sorts songs on the basis of similarity to a given seed song. Similarity is based on a userweighted combination of timbre, genre, tempo, year, and mood. A conclusive user evaluation assessed the usability of the system in comparison to two control systems in which the user control of defining the similarity measure was diminished.
An alternative extension of the kmeans algorithm for clustering categorical data
 Int. J. Appl. Math. Comput. Sci
, 2004
"... Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computatio ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The kmeans algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a kmeanslike algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two wellknown data sets, namely, soybean disease and nursery databases.
The BANGClustering System: GridBased Data Analysis
 Proc. Sec. Int. Symp. IDA97
, 1997
"... . For the analysis of large images the clustering of the data set is a common technique to identify correlation characteristics of the underlying value space. In this paper a new approach to hierarchical clustering of very large data sets is presented. The BANGClustering system presented in this pa ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
. For the analysis of large images the clustering of the data set is a common technique to identify correlation characteristics of the underlying value space. In this paper a new approach to hierarchical clustering of very large data sets is presented. The BANGClustering system presented in this paper is a novel approach to hierarchical data analysis. It is based on the BANGClustering method ([Sch96]) and uses a multidimensional grid data structure to organize the value space surrounding the pattern values. The patterns are grouped into blocks and clustered with respect to the blocks by a topological neighbor search algorithm. 1 Introduction Clustering methods are extremely important for explorative data analysis, which is an important approach for the analysis of images. Previously presented algorithms can be divided into hierarchical algorithms, e.g. singlelinkage, completelinkage, etc. and partitional algorithms, e.g. KMEANS, ISODATA, etc. (see [DJ80]). All of these methods suf...
Fuzzy Clustering of Quantitative and Qualitative Data
"... Abstract — In many applications the objects to cluster are described by quantitative as well as qualitative features. A variety of algorithms has been proposed for unsupervised classification if fuzzy partitions and descriptive cluster prototypes are desired. However, most of these methods are desig ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract — In many applications the objects to cluster are described by quantitative as well as qualitative features. A variety of algorithms has been proposed for unsupervised classification if fuzzy partitions and descriptive cluster prototypes are desired. However, most of these methods are designed for data sets with variables measured in the same scale type (only categorical, or only metric). We propose a new fuzzy clustering approach based on a probabilistic distance measure. Thus a major drawback of present methods can be avoided which lies in the vulnerability to favor one type of attributes. I.
Learning the daily model of network traffic
"... Abstract. Anomaly detection is based on profiles that represent normal behaviour of users, hosts or networks and detects attacks as significant deviations from these profiles. In the paper we propose a methodology based on the application of several data mining methods for the construction of the “n ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract. Anomaly detection is based on profiles that represent normal behaviour of users, hosts or networks and detects attacks as significant deviations from these profiles. In the paper we propose a methodology based on the application of several data mining methods for the construction of the “normal” model of the ingoing traffic of a departmentlevel network. The methodology returns a daily model of the network traffic as a result of four main steps: first, daily network connections are reconstructed from TCP/IP packet headers passing through the firewall and represented by means of feature vectors; second, network connections are grouped by applying a clustering method; third, clusters are described as sets of rules generated by a supervised inductive learning algorithm; fourth, rules are transformed into symbolic objects and similarities between symbolic objects are computed for each couple of days. The result is a longitudinal model of the similarity of network connections that can be used by a network administrator to identify deviations in network traffic patterns that may demand for his/her attention. The proposed methodology has been tested on log files of the firewall of our University Department.
On the Impact of Dissimilarity Measure in kModes Clustering Algorithm
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2007
"... This correspondence describes extensions to the kmodes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical objects, a heuristic approach was developed in [4, 12], which allows the use of the kmodes paradigm to obtain a cluster with stro ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This correspondence describes extensions to the kmodes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical objects, a heuristic approach was developed in [4, 12], which allows the use of the kmodes paradigm to obtain a cluster with strong intrasimilarity, and to efficiently cluster large categorical data sets. The main aim of this paper is to derive rigorously the updating formula of the kmodes clustering algorithm with the new dissimilarity measure, and the convergence of the algorithm under the optimization framework. Index Terms – Data mining, clustering, kmodes algorithm, categorical data. 1