• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Efficient clustering of very large document collections. In Data mining for scientific and engineering applications (2001)

by I S Dhillon, J Fan, Y Guan
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 44
Next 10 →

Co-clustering documents and words using Bipartite Spectral Graph Partitioning

by Inderjit S. Dhillon , 2001
"... ..."
Abstract - Cited by 220 (6 self) - Add to MetaCart
Abstract not found

Information-Theoretic Co-Clustering

by Inderjit S. Dhillon, Subramanyam Mallela, Dharmendra S. Modha - In KDD , 2003
"... Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views ..."
Abstract - Cited by 185 (9 self) - Add to MetaCart
Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory -- the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters.

Survey of clustering data mining techniques

by Pavel Berkhin , 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract - Cited by 177 (0 self) - Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique

Semi-supervised Clustering by Seeding

by Sugato Basu, Arindam Banerjee, R. Mooney - In Proceedings of 19th International Conference on Machine Learning (ICML-2002 , 2002
"... Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It intr ..."
Abstract - Cited by 98 (14 self) - Add to MetaCart
Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It introduces two semi-supervised variants of KMeans clustering that can be viewed as instances of the EM algorithm, where labeled data provides prior information about the conditional distributions of hidden category labels. Experimental results demonstrate the advantages of these methods over standard random seeding and COP-KMeans, a previously developed semi-supervised clustering algorithm.

Kernel k-means, Spectral Clustering and Normalized Cuts

by Inderjit S. Dhillon, Yuqiang Guan, Brian Kulis - KDD '04 , 2004
"... ..."
Abstract - Cited by 77 (5 self) - Add to MetaCart
Abstract not found

Information retrieval on the Web

by Mei Kobayashi, Koichi Takeda - ACM Computing Surveys , 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract - Cited by 58 (0 self) - Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited

A Comparative Study of Generative Models for Document Clustering

by Shi Zhong, Joydeep Ghosh - In SIAM Int. Conf. Data Mining Workshop on Clustering High Dimensional Data and Its Applications , 2003
"... Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mi ..."
Abstract - Cited by 26 (4 self) - Add to MetaCart
Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Recently, the spherical k-means algorithm, which has desirable properties for text clustering, has been shown to be a special case of a generative model based on a mixture of von Mises-Fisher (vMF) distributions. This paper compares these three probabilistic models for text clustering, both theoretically and empirically, using a general model-based clustering framework. For each model, we investigate three strategies for assigning documents to models: maximum likelihood (k-means) assignment, stochastic assignment, and soft assignment. Our experimental results over a large number of datasets show that, in terms of clustering quality, (a) The Bernoulli model is the worst for text clustering; (b) The vMF model produces better clustering results than both Bernoulli and multinomial models; (c) Soft assignment leads to comparable or slightly better results than hard assignment. We also use deterministic annealing (DA) to improve the vMF-based soft clustering and compare all the model-based algorithms with the state-of-the-art discriminative approach to document clustering based on graph partitioning (CLUTO) and a spectral co-clustering method. Overall, CLUTO and DA perform the best but are also the most computationally expensive; the spectral coclustering algorithm fares worse than the vMF-based methods.

Generative model-based document clustering: a comparative study

by Shi Zhong - Knowledge and Information Systems , 2005
"... Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervis ..."
Abstract - Cited by 16 (0 self) - Add to MetaCart
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial modelbased semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.

Frequency Sensitive Competitive Learning for Clustering on High-dimensional Hyperspheres

by Arindam Banerjee, Joydeep Ghosh , 2002
"... This paper derives three competitive learning mechanisms from first principles to obtain clusters of comparable sizes when both inputs and representatives are normalized. These mechanisms are very effective in achieving balanced grouping of inputs in high dimensional spaces, as illustrated by experi ..."
Abstract - Cited by 15 (7 self) - Add to MetaCart
This paper derives three competitive learning mechanisms from first principles to obtain clusters of comparable sizes when both inputs and representatives are normalized. These mechanisms are very effective in achieving balanced grouping of inputs in high dimensional spaces, as illustrated by experimental results on clustering two popular text data sets in 26,099 and 21,839 dimensional spaces respectively.

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

by Dimitrios Zeimpekis, Efstratios Gallopoulos , 2005
"... A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching too ..."
Abstract - Cited by 15 (1 self) - Add to MetaCart
A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that is presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdm’s from text collections and for the incremental modification of these tdm’s by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University