Results 1  10
of
388
Coclustering documents and words using Bipartite Spectral Graph Partitioning
, 2001
"... ..."
(Show Context)
InformationTheoretic CoClustering
 In KDD
, 2003
"... Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views ..."
Abstract

Cited by 326 (10 self)
 Add to MetaCart
(Show Context)
Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the coclustering problem as an optimization problem in information theory  the optimal coclustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters.
A Probabilistic Framework for SemiSupervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract

Cited by 259 (14 self)
 Add to MetaCart
(Show Context)
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototypebased clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and Idivergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semisupervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
 Data Mining and Knowledge Discovery
, 2002
"... Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at ..."
Abstract

Cited by 240 (6 self)
 Add to MetaCart
(Show Context)
Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.
Criterion Functions for Document Clustering: Experiments and Analysis
, 2002
"... In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and org ..."
Abstract

Cited by 195 (12 self)
 Add to MetaCart
(Show Context)
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and highquality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via clusterdriven dimensionality reduction, termweighting, or query expansion. This everincreasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexityquality tradeoffs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 129 (9 self)
 Add to MetaCart
(Show Context)
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
A DataClustering Algorithm On Distributed Memory Multiprocessors
 In LargeScale Parallel Data Mining, Lecture Notes in Artificial Intelligence
, 2000
"... To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analyticall ..."
Abstract

Cited by 127 (1 self)
 Add to MetaCart
(Show Context)
To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops. Keywords: kmeans, data mining, massive data sets, messagepassing, text mining. 1 Introduction Data sets measuring in gigabytes and even terabytes are now quite common in data and text minin...
Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models
 IEEE Trans. Pattern Anal. Machine Intell
, 2009
"... Abstract—We propose a novel unsupervised learning framework to model activities and interactions in crowded and complicated scenes. Under our framework, hierarchical Bayesian models are used to connect three elements in visual surveillance: lowlevel visual features, simple “atomic ” activities, and ..."
Abstract

Cited by 116 (16 self)
 Add to MetaCart
(Show Context)
Abstract—We propose a novel unsupervised learning framework to model activities and interactions in crowded and complicated scenes. Under our framework, hierarchical Bayesian models are used to connect three elements in visual surveillance: lowlevel visual features, simple “atomic ” activities, and interactions. Atomic activities are modeled as distributions over lowlevel visual features, and multiagent interactions are modeled as distributions over atomic activities. These models are learned in an unsupervised way. Given a long video sequence, moving pixels are clustered into different atomic activities and short video clips are clustered into different interactions. In this paper, we propose three hierarchical Bayesian models: the Latent Dirichlet Allocation (LDA) mixture model, the Hierarchical Dirichlet Processes (HDP) mixture model, and the Dual Hierarchical Dirichlet Processes (DualHDP) model. They advance existing topic models, such as LDA [1] and HDP [2]. Directly using existing LDA and HDP models under our framework, only moving pixels can be clustered into atomic activities. Our models can cluster both moving pixels and video clips into atomic activities and into interactions. The LDA mixture model assumes that it is already known how many different types of atomic activities and interactions occur in the scene. The HDP mixture model automatically decides the number of categories of atomic activities. The DualHDP automatically decides the numbers of categories of both atomic activities and interactions. Our data sets are challenging video sequences from crowded traffic scenes and train station scenes with many kinds of activities cooccurring. Without tracking and human labeling effort, our framework completes many challenging visual surveillance tasks of broad interest such as: 1) discovering and providing a summary of typical atomic activities and interactions occurring in the scene, 2) segmenting long video sequences into
Orthogonal nonnegative matrix trifactorizations for clustering
 In SIGKDD
, 2006
"... Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constr ..."
Abstract

Cited by 110 (21 self)
 Add to MetaCart
(Show Context)
Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constrained 2factor NMF. We study the orthogonality constraint because it leads to rigorous clustering interpretation. We provide new rules for updating F,S,G and prove the convergence of these algorithms. Experiments on 5 datasets and a real world case study are performed to show the capability of biorthogonal 3factor NMF on simultaneously clustering rows and columns of the input data matrix. We provide a new approach of evaluating the quality of clustering on words using class aggregate distribution and multipeak distribution. We also provide an overview of various NMF extensions and examine their relationships.
Efficient Clustering Of Very Large Document Collections
, 2001
"... An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical da ..."
Abstract

Cited by 106 (10 self)
 Add to MetaCart
An invaluable portion of scientific data occurs naturally in text form. Given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documents. By using a vector space model, text data can be treated as highdimensional but sparse numerical data vectors. It is a contemporary challenge to efficiently preprocess and cluster very large document collections. In this paper we present a time and memory ecient technique for the entire clustering process, including the creation of the vector space model. This efficiency is obtained by (i) a memoryecient multithreaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Detailed experimental results are presented  a highlight of our results is that we are able to effectively cluster a collection of 113,716 NSF award abstracts in 23 minutes (including disk I/O costs) on a single workstation with modest memory consumption.