Results 1  10
of
177
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 400 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
InformationTheoretic CoClustering
 In KDD
, 2003
"... Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views ..."
Abstract

Cited by 346 (12 self)
 Add to MetaCart
(Show Context)
Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the coclustering problem as an optimization problem in information theory  the optimal coclustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters.
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 274 (6 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Spectral Relaxation for Kmeans Clustering
, 2001
"... The popular Kmeans clustering partitions a data set by minimizing a sumofsquares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the dat ..."
Abstract

Cited by 197 (27 self)
 Add to MetaCart
(Show Context)
The popular Kmeans clustering partitions a data set by minimizing a sumofsquares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a byproduct we also derive a lower bound for the minimum of the sumofsquares cost function.
Learning human action via information maximization, CVPR
, 2008
"... In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearancebased human action model. A video sequence is represented by a bag of spatiotemporal features called videowords by quantizing the extracted 3D interest points (cuboids) from the videos. ..."
Abstract

Cited by 149 (13 self)
 Add to MetaCart
(Show Context)
In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearancebased human action model. A video sequence is represented by a bag of spatiotemporal features called videowords by quantizing the extracted 3D interest points (cuboids) from the videos. Our proposed approach is able to automatically discover the optimal number of videoword clusters by utilizing Maximization of Mutual Information(MMI). Unlike the kmeans algorithm, which is typically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI clustering further groups the videowords, which are highly correlated to some group of actions. To capture the structural information of the learnt optimal videoword clusters, we explore the correlation of the compact videoword clusters. We use the modified correlgoram, which is not only translation and rotation invariant, but also somewhat scale invariant. We extensively test our proposed approach on two publicly available challenging datasets: the KTH dataset and IXMAS multiview dataset. To the best of our knowledge, we are the first to try the bag of videowords related approach on the multiview dataset. We have obtained very impressive results on both datasets. 1.
2005), “Disambiguating Web Appearances of People in a Social Network
 Proceedings of the 2005 World Wide Web Conference
"... Say you are looking for information about a particular person. A search engine returns many pages for that person’s name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are ..."
Abstract

Cited by 126 (2 self)
 Add to MetaCart
(Show Context)
Say you are looking for information about a particular person. A search engine returns many pages for that person’s name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering (A/CDC)—an application of a recently introduced multiway distributional clustering method. To evaluate our methods, we collected and handlabeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more than 20%, achieving over 80 % Fmeasure.
deltaClusters: Capturing Subspace Correlation in a Large Data Set
 Proc. of 18th IEEE Intern. Conf. on Data Engineering
, 2002
"... Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimensi ..."
Abstract

Cited by 110 (4 self)
 Add to MetaCart
Clustering has been an active research area of great practical importance for recent years. Most previous clustering models have focused on grouping objects with similar values on a (sub)set of dimensions (e.g., subspace cluster) and assumed that every object has an associated value on every dimension (e.g., bicluster). These existing cluster models may not always be adequate in capturing coherence exhibited among objects. Strong coherence may still exist among a set of objects (on a subset of attributes) even if they take quite different values on each attribute and the attribute values are not fully specified. This is very common in many applications including bioinformatics analysis as well as collaborative filtering analysis, where the data may be incomplete and subject to biases. In bioinformatics, a bicluster model has recently been proposed to capture coherence among a subset of the attributes. Here, we introduce a more general model, referred to as the fficluster model, to capture coherence exhibited by a subset of objects on a subset of attributes, while allowing absent attribute values. A movebased algorithm (FLOC) is devised to efficiently produce a nearoptimal clustering results. The fficluster model takes the bicluster model as a special case, where the FLOC algorithm performs far superior to the bicluster algorithm. We demonstrate the correctness and efficiency of the fficluster model and the FLOC algorithm on a number of real and synthetic data sets.
Solving Cluster Ensemble Problems by Bipartite Graph Partitioning
 IN PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MACHINE LEARNING
, 2004
"... A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constr ..."
Abstract

Cited by 109 (3 self)
 Add to MetaCart
(Show Context)
A critical problem in cluster ensemble research is how to combine multiple clusterings to yield a final superior clustering result. Leveraging advanced graph partitioning techniques, we solve this problem by reducing it to a graph partitioning problem. We introduce a new reduction method that constructs a bipartite graph from a given cluster ensemble. The resulting graph models both instances and clusters of the ensemble simultaneously as vertices in the graph. Our approach retains all of the information provided by a given ensemble, allowing the similarity among instances and the similarity among clusters to be considered collectively in forming the final clustering. Further, the resulting graph partitioning problem can be solved efficiently. We empirically evaluate the proposed approach against two commonly used graph formulations and show that it is more robust and achieves comparable or better performance in comparison to its competitors.
Multivariate information bottleneck
, 2001
"... The Information bottleneck method is an unsupervised nonparametric data organization technique. Given a joint distribution¢¤£¦¥¨§�©� � , this method constructs a new variable � that extracts partitions, or clusters, over the values of ¥ that are informative about ©. The information bottleneck has a ..."
Abstract

Cited by 96 (13 self)
 Add to MetaCart
The Information bottleneck method is an unsupervised nonparametric data organization technique. Given a joint distribution¢¤£¦¥¨§�©� � , this method constructs a new variable � that extracts partitions, or clusters, over the values of ¥ that are informative about ©. The information bottleneck has already been applied to document classification, gene expression, neural code, and spectral analysis. In this paper, we introduce a general principled framework for multivariate extensions of the information bottleneck method. This allows us to consider multiple systems of data partitions that are interrelated. Our approach utilizes Bayesian networks for specifying the systems of clusters and what information each captures. We show that this construction provides insight about bottleneck variations and enables us to characterize solutions of these variations. We also present a general framework for iterative algorithms for constructing solutions, and apply it to several examples. 1
The power of word clusters for text classification
 In 23rd European Colloquium on Information Retrieval Research
, 2001
"... The recently introduced Information Bottleneck method [21] provides an information theoretic framework, for extracting features of one variable, that are relevant for the values of another variable. Several previous works already suggested applying this method for document clustering, gene expressio ..."
Abstract

Cited by 79 (7 self)
 Add to MetaCart
(Show Context)
The recently introduced Information Bottleneck method [21] provides an information theoretic framework, for extracting features of one variable, that are relevant for the values of another variable. Several previous works already suggested applying this method for document clustering, gene expression data analysis, spectral analysis and more. In this work we present a novel implementation of this method for supervised text classification. Specifically, we apply the information bottleneck method to find wordclusters that preserve the information about document categories and use these clusters as features for classification. Previous work [1] used a similar clustering procedure to show that wordclusters can significantly reduce the feature space dimensionality, with only a minor change in classification accuracy. In this work we present similar results and go further to show that when the training sample is small word clusters can yield significant improvement in classification accuracy (up to ¢¡¤£) over the performance using the words directly. 1