Results 1  10
of
551
Data Clustering: A Review
 ACM COMPUTING SURVEYS
, 1999
"... Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exp ..."
Abstract

Cited by 1892 (14 self)
 Add to MetaCart
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify crosscutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
Multiobjective Evolutionary Algorithms: Analyzing the StateoftheArt
, 2000
"... Solving optimization problems with multiple (often conflicting) objectives is, generally, a very difficult goal. Evolutionary algorithms (EAs) were initially extended and applied during the mideighties in an attempt to stochastically solve problems of this generic class. During the past decade, ..."
Abstract

Cited by 425 (7 self)
 Add to MetaCart
(Show Context)
Solving optimization problems with multiple (often conflicting) objectives is, generally, a very difficult goal. Evolutionary algorithms (EAs) were initially extended and applied during the mideighties in an attempt to stochastically solve problems of this generic class. During the past decade, a variety of multiobjective EA (MOEA) techniques have been proposed and applied to many scientific and engineering applications. Our discussion's intent is to rigorously define multiobjective optimization problems and certain related concepts, present an MOEA classification scheme, and evaluate the variety of contemporary MOEAs. Current MOEA theoretical developments are evaluated; specific topics addressed include fitness functions, Pareto ranking, niching, fitness sharing, mating restriction, and secondary populations. Since the development and application of MOEAs is a dynamic and rapidly growing activity, we focus on key analytical insights based upon critical MOEA evaluation of c...
Latent Space Approaches to Social Network Analysis
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2001
"... Network models are widely used to represent relational information among interacting units. In studies of social networks, recent emphasis has been placed on random graph models where the nodes usually represent individual social actors and the edges represent the presence of a specified relation be ..."
Abstract

Cited by 315 (20 self)
 Add to MetaCart
Network models are widely used to represent relational information among interacting units. In studies of social networks, recent emphasis has been placed on random graph models where the nodes usually represent individual social actors and the edges represent the presence of a specified relation between actors. We develop a class of models where the probability of a relation between actors depends on the positions of individuals in an unobserved "social space." Inference for the social space is developed within a maximum likelihood and Bayesian framework, and Markov chain Monte Carlo procedures are proposed for making inference on latent positions and the effects of observed covariates. We present analyses of three standard datasets from the social networks literature, and compare the method to an alternative stochastic blockmodeling approach. In addition to improving upon model fit, our method provides a visual and interpretable modelbased spatial representation of social relationships, and improves upon existing methods by allowing the statistical uncertainty in the social space to be quantified and graphically represented.
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 274 (6 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Visualizing data using tSNE
 Costsensitive Machine Learning for Information Retrieval 33
"... We present a new technique called “tSNE ” that visualizes highdimensional data by giving each datapoint a location in a two or threedimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly ..."
Abstract

Cited by 273 (11 self)
 Add to MetaCart
(Show Context)
We present a new technique called “tSNE ” that visualizes highdimensional data by giving each datapoint a location in a two or threedimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. tSNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for highdimensional data that lie on several different, but related, lowdimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how tSNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of tSNE on a wide variety of data sets and compare it with many other nonparametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by tSNE are significantly better than those produced by the other techniques on almost all of the data sets.
Clustering of the SelfOrganizing Map
, 2000
"... The selforganizing map (SOM) is an excellent tool in exploratory phase of data mining. It projects input space on prototypes of a lowdimensional regular grid that can be effectively utilized to visualize and explore properties of the data. When the number of SOM units is large, to facilitate quant ..."
Abstract

Cited by 269 (1 self)
 Add to MetaCart
(Show Context)
The selforganizing map (SOM) is an excellent tool in exploratory phase of data mining. It projects input space on prototypes of a lowdimensional regular grid that can be effectively utilized to visualize and explore properties of the data. When the number of SOM units is large, to facilitate quantitative analysis of the map and the data, similar units need to be grouped, i.e., clustered. In this paper, different approaches to clustering of the SOM are considered. In particular, the use of hierarchical agglomerative clustering and partitive clustering usingmeans are investigated. The twostage procedurefirst using SOM to produce the prototypes that are then clustered in the second stageis found to perform well when compared with direct clustering of the data and to reduce the computation time.
Probabilistic nonlinear principal component analysis with Gaussian process latent variable models
 Journal of Machine Learning Research
, 2005
"... Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component ..."
Abstract

Cited by 225 (24 self)
 Add to MetaCart
(Show Context)
Summarising a high dimensional data set with a low dimensional embedding is a standard approach for exploring its structure. In this paper we provide an overview of some existing techniques for discovering such embeddings. We then introduce a novel probabilistic interpretation of principal component analysis (PCA) that we term dual probabilistic PCA (DPPCA). The DPPCA model has the additional advantage that the linear mappings from the embedded space can easily be nonlinearised through Gaussian processes. We refer to this model as a Gaussian process latent variable model (GPLVM). Through analysis of the GPLVM objective function, we relate the model to popular spectral techniques such as kernel PCA and multidimensional scaling. We then review a practical algorithm for GPLVMs in the context of large data sets and develop it to also handle discrete valued data and missing attributes. We demonstrate the model on a range of realworld and artificially generated data sets.
Contentbased Organization and Visualization of Music Archives
, 2002
"... With Islands of Music we present a system which facilitates exploration of music libraries without requiring manual genre classification. Given pieces of music in raw audio format we estimate their perceived sound similarities based on psychoacoustic models. Subsequently, the pieces are organized on ..."
Abstract

Cited by 131 (26 self)
 Add to MetaCart
(Show Context)
With Islands of Music we present a system which facilitates exploration of music libraries without requiring manual genre classification. Given pieces of music in raw audio format we estimate their perceived sound similarities based on psychoacoustic models. Subsequently, the pieces are organized on a 2dimensional map so that similar pieces are located close to each other. A visualization using a metaphor of geographic maps provides an intuitive interface where islands resemble genres or styles of music. We demonstrate the approach using a collection of 359 pieces of music.
SOMBased Data Visualization Methods
 Intelligent Data Analysis
, 1999
"... The SelfOrganizing Map (SOM) is an efficient tool for visualization of multidimensional numerical data. In this paper, an overview and categorization of both old and new methods for the visualization of SOM is presented. The purpose is to give an idea of what kind of information can be acquired fro ..."
Abstract

Cited by 125 (4 self)
 Add to MetaCart
(Show Context)
The SelfOrganizing Map (SOM) is an efficient tool for visualization of multidimensional numerical data. In this paper, an overview and categorization of both old and new methods for the visualization of SOM is presented. The purpose is to give an idea of what kind of information can be acquired from different presentations and how the SOM can best be utilized in exploratory data visualization. Most of the presented methods can also be applied in the more general case of first making a vector quantization (e.g. kmeans) and then a vector projection (e.g. Sammon's mapping).
Data Exploration Using SelfOrganizing Maps
 ACTA POLYTECHNICA SCANDINAVICA: MATHEMATICS, COMPUTING AND MANAGEMENT IN ENGINEERING SERIES NO. 82
, 1997
"... Finding structures in vast multidimensional data sets, be they measurement data, statistics, or textual documents, is difficult and timeconsuming. Interesting, novel relations between the data items may be hidden in the data. The selforganizing map (SOM) algorithm of Kohonen can be used to aid the ..."
Abstract

Cited by 115 (4 self)
 Add to MetaCart
Finding structures in vast multidimensional data sets, be they measurement data, statistics, or textual documents, is difficult and timeconsuming. Interesting, novel relations between the data items may be hidden in the data. The selforganizing map (SOM) algorithm of Kohonen can be used to aid the exploration: the structures in the data sets can be illustrated on special map displays. In this work, the methodology of using SOMs for exploratory data analysis or data mining is reviewed and developed further. The properties of the maps are compared with the properties of related methods intended for visualizing highdimensional multivariate data sets. In a set of case studies the SOM algorithm is applied to analyzing electroencephalograms, to illustrating structures of the standard of living in the world, and to organizing fulltext document collections. Measures are proposed for evaluating the quality of different types of maps in representing a given data set, and for measuring the robu...