• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Document clustering using word clusters via the information bottleneck method (2000)

by N Slonim, N Tishby
Venue:In ACM SIGIR
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 79
Next 10 →

Sufficient Dimensionality Reduction

by Amir Globerson, Naftali Tishby, Isabelle Guyon, André Elisseeff - Journal of Machine Learning Research , 2003
"... Dimensionality reduction of empirical co-occurrence data is a fundamental problem in unsupervised learning. It is also a well studied problem in statistics known as the analysis of cross-classified data. One principled approach to this problem is to represent the data in low dimension with minimal l ..."
Abstract - Cited by 28 (8 self) - Add to MetaCart
Dimensionality reduction of empirical co-occurrence data is a fundamental problem in unsupervised learning. It is also a well studied problem in statistics known as the analysis of cross-classified data. One principled approach to this problem is to represent the data in low dimension with minimal loss of (mutual) information contained in the original data. In this paper we introduce an information theoretic nonlinear method for finding such a most informative dimension reduction. In contrast with...

Iterative double clustering for unsupervised and semi-supervised learning

by Ran El-yaniv, Oren Souroujon - In Advances in Neural Information Processing Systems (NIPS , 2001
"... We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data ..."
Abstract - Cited by 27 (2 self) - Add to MetaCart
We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1

Spectral Relaxation Models And Structure Analysis For K-Way Graph Clustering And Bi-Clustering

by M. Gu, H. Zha, C. Ding, X. He, H. Simon, J. Xia , 2001
"... In this paper we consider k-way graph clustering and k-way bipartite graph clustering. Many data types arising from data mining applications can be modeled as graphs and bipartite graphs. Examples include webpage links on the world-wide web, terms and documents in a text corpus, customers and purcha ..."
Abstract - Cited by 22 (3 self) - Add to MetaCart
In this paper we consider k-way graph clustering and k-way bipartite graph clustering. Many data types arising from data mining applications can be modeled as graphs and bipartite graphs. Examples include webpage links on the world-wide web, terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, we discuss models for k-way irregular graph partitioning and k-way bipartite irregular graph partitioning, and discuss the algebraic structures in the eigenvector and singular vector matrices that correspond to the optimal partition. Our discussion contributes to the theoretical understanding of spectral methods for graph partitioning and data mining, which has been a focus of some recent research.

A Large Benchmark Dataset for Web Document Clustering

by Mark Sinka, David Corne - Soft Computing Systems: Design, Management and Applications, Volume 87 of Frontiers in Artificial Intelligence and Applications , 2002
"... Targeting useful and relevant information on the WWW is a topical and highly complicated research area. A thriving research effort that feeds into this area is document clustering, which overlaps closely with areas usually known as text classification and text categorisation. A foundational aspect o ..."
Abstract - Cited by 21 (1 self) - Add to MetaCart
Targeting useful and relevant information on the WWW is a topical and highly complicated research area. A thriving research effort that feeds into this area is document clustering, which overlaps closely with areas usually known as text classification and text categorisation. A foundational aspect of such research (which has been proven over and over again in other research disciplines) is the use of standard datasets, against which different techniques can be properly benchmarked andassessedincomparisontoeachother. Wenotehereinthat,sofarinthisbroad area of research, as many datasets have been used as research papers written, thus making it difficult to reason about the relative performance of different categorisation/clustering techniques used in different papers. In this paper we propose a standard dataset with a variety of properties suitable for a wide range of clustering and related experiments. We describe how the dataset was generated, and provide a pointer to it, and encourage its access and use. We also illustrate the use of part of the dataset by establishing benchmark results for simple k-means clustering, comparing the relative performance of k-means on a pair of `close' categories and a pair of `distant' categories. We naturally find that performance is better on the pair of `distant' categories, however the experiments reveal that although stop-word removal is confirmed as helpful, word-stemming is, (perhaps counter to intuition), not necessarily always recommended on `distant' categories.

Information theoretic clustering of sparse co-occurrence data

by Inderjit S. Dhillon, Yuqiang Guandepartment, Computer Sciences - In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03 , 2003
"... ..."
Abstract - Cited by 21 (1 self) - Add to MetaCart
Abstract not found

CLUSEQ: Efficient and Effective Sequence Clustering

by Jiong Yang, Wei Wang - In ICDE , 2003
"... Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven ..."
Abstract - Cited by 21 (2 self) - Add to MetaCart
Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven to be very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed extensively on sequence data (in categorical domain) is the lack of an effective yet efficient similarity measure. Therefore, we propose a novel model (CLUSEQ) for sequence cluster by exploring significant statistical properties possessed by the sequences. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence behavior and to support the similarity measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize (the significant portion of) the CPD in a concise way. A novel algorithm is devised to efficiently discover clusters with high quality and is able to automatically adjust the number of clusters to its optimal range via a unique combination of successive new cluster generation and cluster consolidation. The performance of CLUSEQ has been demonstrated via extensive experiments on several real and synthetic sequence databases. 1

Supervised Clustering – Algorithms and Benefits

by Christoph F. Eick, Nidal Zeidat, Zhenghong Zhao - In proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI04) , Boca , 2004
"... This paper centers on a novel data mining technique we term supervised clustering. Unlike traditional clustering, supervised clustering assumes that the examples are classified and has the goal of identifying class-uniform clusters that have high probability densities. Four representative–based algo ..."
Abstract - Cited by 18 (13 self) - Add to MetaCart
This paper centers on a novel data mining technique we term supervised clustering. Unlike traditional clustering, supervised clustering assumes that the examples are classified and has the goal of identifying class-uniform clusters that have high probability densities. Four representative–based algorithms for supervised clustering are introduced: a greedy algorithm with random restart, named SRIDHCR, that seeks for solutions by inserting and removing single objects from the current solution, SPAM (a variation of the clustering algorithm PAM), an evolutionary computing algorithm named SCEC, and a fast medoid-based top-down splitting algorithm, named TDS. The four algorithms were evaluated using a benchmark consisting of four UCI machine learning data sets. In general, it seems that “greedy ” algorithms, such as SPAM, SRIDHCR, and TDS, do not perform particularly well for supervised clustering and seem to terminate prematurely too often. We also briefly describe the applications of supervised clustering. 1.

scalable clustering of categorical data

by Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller, Kenneth C. Sevcik - In EDBT , 2004
"... Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical ..."
Abstract - Cited by 17 (4 self) - Add to MetaCart
Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a trade-off between efficiency (in terms of space and time) and quality. We quantify this trade-off and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality. 1

An Evaluation on Feature Selection for Text Clustering

by Tao Liu, Shengping Liu, Zheng Chen - In ICML , 2003
"... Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of tex ..."
Abstract - Cited by 17 (2 self) - Add to MetaCart
Feature selection methods have been successfully applied to text categorization but seldom applied to text clustering due to the unavailability of class label information. In this paper, we first give empirical evidence that feature selection methods can improve the efficiency and performance of text clustering algorithm. Then we propose a new feature selection method called “Term Contribution (TC) ” and perform a comparative study on a variety of feature selection methods for text clustering, including Document Frequency (DF), Term Strength (TS), Entropy-based (En), Information Gain (IG) and א 2 statistic (CHI). Finally, we propose an “Iterative Feature Selection (IF) ” method that addresses the unavailability of label problem by utilizing effective supervised feature selection method to iteratively select features and perform clustering. Detailed experimental results on Web Directory data are provided in the paper. 1.

Coupled Clustering: A Method for Detecting Structural Correspondence

by Zvika Marx, Ido Dagan, Joachim M. Buhmann, Eli Shamir - Journal of Machine Learning Research , 2002
"... This paper proposes a new paradigm and a computational framework for revealing equivalencies (analogies) between sub-structures of distinct composite systems that are initially represented by unstructured data sets. For this purpose, we introduce and investigate a variant of traditional data cluster ..."
Abstract - Cited by 16 (3 self) - Add to MetaCart
This paper proposes a new paradigm and a computational framework for revealing equivalencies (analogies) between sub-structures of distinct composite systems that are initially represented by unstructured data sets. For this purpose, we introduce and investigate a variant of traditional data clustering, termed coupled clustering, which outputs a configuration of corresponding subsets of two such representative sets. We apply our method to synthetic as well as textual data. Its achievements in detecting topical correspondences between textual corpora are evaluated through comparison to performance of human experts.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University