Results 1  10
of
215
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 230 (3 self)
 Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Graph mining: Laws, generators, and algorithms
 ACM COMPUTING SURVEYS
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation i ..."
Abstract

Cited by 70 (7 self)
 Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M : N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: "How can we generate synthetic but realistic graphs?" To answer this, we must first understand what patterns are common in realworld graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
Multiway distributional clustering via pairwise interactions
 In ICML
, 2005
"... We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in cooccurrence data. In this scheme, multiple clustering systems are generated aiming at maximi ..."
Abstract

Cited by 51 (10 self)
 Add to MetaCart
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in cooccurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves topdown clustering of some variables and bottomup clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of twoway, threeway and fourway applications of our scheme using six realworld datasets including the 20 Newsgroups (20NG) and the Enron email collection. Our multiway distributional clustering (MDC) algorithms consistently and significantly outperform previous stateoftheart information theoretic clustering algorithms. 1.
Nonsmooth nonnegative matrix factorization (nsnmf
 IEEE transactions on
, 2006
"... Abstract—We propose a novel nonnegative matrix factorization model that aims at finding localized, partbased, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted “nonsmooth nonnegative matrix factoriz ..."
Abstract

Cited by 30 (2 self)
 Add to MetaCart
Abstract—We propose a novel nonnegative matrix factorization model that aims at finding localized, partbased, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted “nonsmooth nonnegative matrix factorization ” (nsNMF), corresponds to the optimization of an unambiguous cost function designed to explicitly represent sparseness, in the form of nonsmoothness, which is controlled by a single parameter. In general, this method produces a set of basis and encoding vectors that are not only capable of representing the original data, but they also extract highly localized patterns, which generally lend themselves to improved interpretability. The properties of this new method are illustrated with several data sets. Comparisons to previously published methods show that the new nsNMF method has some advantages in keeping faithfulness to the data in the achieving a high degree of sparseness for both the estimated basis and the encoding vectors and in better interpretability of the factors. Index Terms—nonnegative matrix factorization, constrained optimization, datamining, mining methods and algorithms, pattern analysis, feature extraction or construction, sparse, structured, and very large systems. æ 1
Modelbased overlapping clustering
 In KDD
, 2005
"... While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model prop ..."
Abstract

Cited by 29 (6 self)
 Add to MetaCart
While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20Newsgroups and EachMovie datasets.
Disco: Distributed coclustering with mapreduce. ICDM
, 2008
"... Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting realworld applications produce huge volumes of messy data. The mining process involves several steps, starting from preprocessing the raw data ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting realworld applications produce huge volumes of messy data. The mining process involves several steps, starting from preprocessing the raw data to estimating the final models. As data become more abundant, scalable and easytouse tools for distributed processing are also emerging. Among those, MapReduce has been widely embraced by both academia and industry. In database terms, MapReduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying MapReduce, from raw data to final models, on an important mining task. In particular, we focus on coclustering, which has been studied in many applications such as text mining, collaborative filtering, bioinformatics, graph mining. We propose the Distributed Coclustering (DisCo) framework, which introduces practical approaches for distributed data preprocessing, and coclustering. We develop DisCo using Hadoop, an open source MapReduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware. 1
Multiway clustering on relation graphs
 In Proc. of the 7th SIAM Intl. Conf. on Data Mining
, 2006
"... A number of realworld domains such as social networks and ecommerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for perform ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
A number of realworld domains such as social networks and ecommerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for performing various predictive modeling tasks. In this paper, we propose a principled multiway clustering framework for relational data, wherein different types of entities are simultaneously clustered based not only on their intrinsic attribute values, but also on the multiple relations between the entities. To achieve this, we introduce a relation graph model that describes all the known relations between the different entity classes, in which each relation between a given set of entity classes is represented in the form of multimodal tensor over an appropriate domain. Our multiway clustering formulation is driven by the objective of capturing the maximal “information ” in the original relation graph, i.e., accurately approximating the set of tensors corresponding to the various relations. This formulation is applicable to all Bregman divergences (a broad family of loss functions that includes squared Euclidean distance, KLdivergence), and also permits analysis of mixed data types using convex combinations of appropriate Bregman loss functions. Furthermore, we present a large family of structurally different multiway clustering schemes that preserve various linear summary statistics of the original data. We accomplish the above generalizations by extending a recently proposed key theoretical result, namely the minimum Bregman information principle [1], to the relation graph setting. We also describe an efficient multiway clustering algorithm based on alternate minimization that generalizes a number of other recently proposed clustering methods. Empirical results on datasets obtained from realworld domains (e.g., movie recommendations, newsgroup articles) demonstrate the generality and efficacy of our framework. 1
The discrete basis problem
, 2005
"... We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem is NPcomplete and that the optimization version cannot be approximated within any finite ratio. We also study two variations of this problem, where the Boolean basis vectors must be mutually otrhogonal. We show that the other variation is closely related with the wellknown Metric kmedian Problem in Boolean space. To solve these problems, two algorithms will be presented. One is designed for the variations mentioned above, and it is solely based on solving the kmedian problem, while another is a heuristic intended to solve the general Discrete Basis Problem. We will also study the results of extensive experiments made with these two algorithms with both synthetic and realworld data. The results are twofold: with the synthetic data, the algorithms did rather well, but with the realworld data the results were not as good.
triCluster: An Effective Algorithm for Mining Coherent Clusters in 3D Microarray Data
 In Proc. of the 2005 ACM SIGMOD international conference on Management of data
, 2005
"... In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di#erent parameter values, it can mine di#erent types of clusters, in ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di#erent parameter values, it can mine di#erent types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. triCluster relies on graphbased approach to mine all valid clusters. For each time slice, i.e., a genesample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of biclusters for this time slice. Then triCluster constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the final set of triclusters. Optionally, triCluster merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that triCluster can find significant triclusters in the real microarray datasets.
Techniques for clustering gene expression data
 COMPUT BIOL MED
, 2007
"... Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data pro ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered.