Results 1  10
of
79
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 238 (3 self)
 Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 157 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 87 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
Clustering objects on subsets of attributes
 Journal of the Royal Statistical Society
, 2004
"... Proofs subject to correction. Not to be reproduced without permission. Confidential until read to the Society. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor. ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
Proofs subject to correction. Not to be reproduced without permission. Confidential until read to the Society. Contributions to the discussion must not exceed 400 words. Contributions longer than 400 words will be cut by the editor.
Clustering large data sets with mixed numeric and categorical values
 In The First PacificAsia Conference on Knowledge Discovery and Data Mining
, 1997
"... Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The kmeans based methods are promising for their efficiency in proce ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The kmeans based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a kprototypes algorithm which is based on the kmeans paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters. 1
The Proximity of an Individual to a Population With Applications in Discriminant Analysis
, 1995
"... : We develop a proximity function between an individual and a population from a distance between multivariate observations. We study some properties of this construction and apply it to a distancebased discrimination rule, which contains the classic linear discriminant function as a particular ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
: We develop a proximity function between an individual and a population from a distance between multivariate observations. We study some properties of this construction and apply it to a distancebased discrimination rule, which contains the classic linear discriminant function as a particular case. Additionally, this rule can be used advantageously for categorical or mixed variables, or in problems where a probabilistic model is not well determined. This approach is illustrated and compared with other classic procedures using four real data sets. Keywords: Categorical and mixed data; Distances between observations; Multidimensional scaling; Discrimination; Classification rules. AMS Subject Classification: 62H30 The authors thank M.Abrahamowicz, J. C. Gower and M. Greenacre for their helpful comments, and W. J. Krzanowski for providing us with a data set and his quadratic location model program. Work supported in part by CGYCIT grant PB930784. Authors' address: Departam...
When can history be our guide? The pitfalls of counterfactual inference
 International Studies Quarterly
, 2007
"... Inferences about counterfactuals are essential for prediction, answering ‘‘what if ’ ’ questions, and estimating causal effects. However, when the counterfactuals posed are too far from the data at hand, conclusions drawn from wellspecified statistical analyses become based on speculation and conve ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Inferences about counterfactuals are essential for prediction, answering ‘‘what if ’ ’ questions, and estimating causal effects. However, when the counterfactuals posed are too far from the data at hand, conclusions drawn from wellspecified statistical analyses become based on speculation and convenient but indefensible model assumptions rather than empirical evidence. Unfortunately, standard statistical approaches assume the veracity of the model rather than revealing the degree of modeldependence, so this problem can be hard to detect. We develop easytoapply methods to evaluate counterfactuals that do not require sensitivity testing over specified classes of models. If an analysis fails the tests we offer, then we know that substantive results are sensitive to at least some modeling choices that are not based on empirical evidence. We use these methods to evaluate the extensive scholarly literatures on the effects of changes in the degree of democracy in a country (on any dependent variable) and separate analyses of the effects of UN peacebuilding efforts. We find evidence that many scholars are inadvertently drawing conclusions based more on modeling hypotheses than on evidence in the data. For some research questions, history contains insufficient information to be our guide. Free software that accompanies this paper implements all our suggestions. Social science is about making inferencesFusing facts we know to learn about facts we do not know. Some inferential targets (the facts we do not know) are factual, which means that they exist even if we do not know them. In early 2003, Saddam Hussein was obviously either alive or dead, but the world did not know which it was
Rapid assessment of the adulteration of virgin olive oils by other seed oils using pyrolysis mass spectrometry and artificial neural networks, J Sci Food Agric 63
, 1993
"... Abstract: Curiepoint pyrolysis mass spectra were obtained from a variety of extravirgin olive oils, prepared from various cultivars using several mechanical treatments. Some of the oils were adulterated (according to a doubleblind protocol) with different amounts of seed oils (50500 ml of soya, ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Abstract: Curiepoint pyrolysis mass spectra were obtained from a variety of extravirgin olive oils, prepared from various cultivars using several mechanical treatments. Some of the oils were adulterated (according to a doubleblind protocol) with different amounts of seed oils (50500 ml of soya, sunflower, peanut, corn or rectified olive oils per litre of mixed oil). Canonical variates analysis indicated that the major source of variation between the pyrolysis mass spectra was due to differences between the cultivars. rather than whether the oils had been adulterated. However, artificial neural networks could be trained (using the backpropagation algorithm) successfully to distinguish virgin oils from those which had been adulterated. Key words: Curiepoint pyrolysis mass spectrometry, artificial neural networks, chemometrics, adulteration, virgin olive oil. INTRODUCTION enjoyed without refining (Kiritsakis and Min 1989). Olive oil therefore commands a higher price than do Virgin olive oil is the oil extracted by purely mechanical other vegetable oils, and these and other properties mean
Clustering in an objectoriented environment
 Journal of Statistical Software
, 1996
"... This paper describes the incorporation of seven standalone clustering programs into SPLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering meth ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
This paper describes the incorporation of seven standalone clustering programs into SPLUS, where they can now be used in a much more flexible way. The original Fortran programs carried out new cluster analysis algorithms introduced in the book of Kaufman and Rousseeuw (1990). These clustering methods were designed to be robust and to accept dissimilarity data as well as objectsbyvariables data. Moreover, they each provide a graphical display and a quality index reflecting the strength of the clustering. The powerful graphics of SPLUS made it possible to improve these graphical representations considerably. The integration of the clustering algorithms was performed according to the objectoriented principle supported by SPLUS. The new functions have a uniform interface, and are compatible with existing SPLUS functions. We will describe the basic idea and the use of each clustering method, together with its graphical features. Each function is briefly illustrated with an example.
An evaluation of the use of Multidimensional Scaling for understanding brain connectivity
 Philosophical Transactions of the Royal Society, Series B
, 1994
"... A large amount of data is now available about the pattern of connections between brain regions. Computational methods are increasingly relevant for uncovering structure in such datasets. There has been recent interest in the use of Nonmetric Multidimensional Scaling (NMDS) for such analysis (Young, ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
A large amount of data is now available about the pattern of connections between brain regions. Computational methods are increasingly relevant for uncovering structure in such datasets. There has been recent interest in the use of Nonmetric Multidimensional Scaling (NMDS) for such analysis (Young, 1992, 1993; Scannell & Young, 1993). NMDS produces a spatial representation of the "dissimilarities" between a number of entities. Normally, it is applied to data matrices containing a large number of levels of dissimilarity, whereas for connectivity data there is a very small number. We address the suitability of NMDS for this case. Systematic numerical studies are presented to evaluate the ability of this method to reconstruct known geometrical configurations from dissimilarity data possessing few levels. In this case there is a strong bias for NMDS to produce annular configurations, whether or not such structure exists in the original data. Using a connectivity dataset derived from the pr...