## Information theoretic clustering of sparse co-occurrence data (2003)

Venue: | In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03 |

Citations: | 25 - 2 self |

### BibTeX

@INPROCEEDINGS{Dhillon03informationtheoretic,

author = {Inderjit S. Dhillon and Yuqiang Guandepartment and Computer Sciences},

title = {Information theoretic clustering of sparse co-occurrence data},

booktitle = {In Proceedings of the Third IEEE International Conference on Data Mining (ICDM-03},

year = {2003},

pages = {517--521},

publisher = {IEEE Press}

}

### OpenURL

### Abstract

### Citations

8548 |
Elements of Information Theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... about the other. The higher its value the less is the uncertainty of one random variable due to knowledge about the other. For example if X and Y are independent, I(X; Y ) = 0. For more details, see =-=[4]-=-. A novel information-theoretic approach to clustering is to seek that clustering which results in the smallest loss in mutual information [21, 6], i.e. I(X; Y ) \GammasI(X; ^Y ) = X x X y p(x; y) log... |

2138 |
Dubes. Algorithms for Clustering Data
- Jain, C
- 1988
(Show Context)
Citation Context ...or brevity. The logarithmic base 2 is used throughout this paper. 2 Related work Clustering is a widely studied problem in unsupervised learning, and a good survey of existing methods can be found in =-=[8, 13, 11]-=-. Clustering algorithms can be categorized into agglomerative clustering algorithms and divisive clustering algorithms. Agglomerative clustering algorithm starts with each individual data item in its ... |

1915 |
Pattern Classification
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...od is highly effective in clustering document collections and outperforms previous information-theoretic clustering approaches. 1 Introduction Clustering is a central problem in unsupervised learning =-=[8]-=-. Presented with a set of data points, clustering algorithms group the data into clusters according to some notion of similarity between data points. However, the choice of similarity measure is a cha... |

1850 |
Data Mining: Concepts and Techniques
- Han, Kamber, et al.
- 2005
(Show Context)
Citation Context ...or brevity. The logarithmic base 2 is used throughout this paper. 2 Related work Clustering is a widely studied problem in unsupervised learning, and a good survey of existing methods can be found in =-=[8, 13, 11]-=-. Clustering algorithms can be categorized into agglomerative clustering algorithms and divisive clustering algorithms. Agglomerative clustering algorithm starts with each individual data item in its ... |

1043 |
An efficient heuristic procedure for partitioning graphs
- Kernighan, Lin
- 1970
(Show Context)
Citation Context ...ce of successive moves that result in the largest possible reduction in the loss of mutual information (this enhancement was inspired by the successful Kernighan-Lin graph partitioning heuristic, see =-=[15]-=-). Algorithm 8sAlgorithm: DITC LocalSearch (p(X; Y ); k; f^yyj gkj=1; L) Input: p(X; Y ) is the empirical joint probability distribution, k is the number of desired clusters, L is chain length for loc... |

433 | The information bottleneck method
- Tishby, Pereira, et al.
- 1999
(Show Context)
Citation Context ... algorithms are based on graph partitioning [14]. For the case of non-negative, co-occurrence data, our information-theoretic framework is similar to the one used in the information bottleneck method =-=[25]-=-. The information bottleneck method tries to minimize the quantity I(Y ; ^Y ) in order to gain compression in addition to maximizing the mutual information I(X; ^Y ); the optimization problem consider... |

336 | Robust Clustering Algorithm for Categorization Attributes
- Guha, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...distance measure is often defined. Similarity/distance measures between data items can be defined based on a measure, such as Euclidean distance, cosine [24] or based on boolean or categorical values =-=[10]-=-. Alternative clustering algorithms are based on graph partitioning [14]. For the case of non-negative, co-occurrence data, our information-theoretic framework is similar to the one used in the inform... |

249 | Information-theoretic co-clustering, in - Dhillon, Mallela, et al. |

247 | Deterministic annealing for clustering, compression, classification, regression, and related optimization problems
- Rose
- 1998
(Show Context)
Citation Context ...eoff between compression and preservation of mutual information. Information bottleneck algorithm yields a "soft" clustering of the data using a procedure which similar to the deterministic annealing =-=[19]-=- A greedy agglomerative hard clustering was used in [1, 23] to cluster words in order to reduce feature size for supervised text classification. For the same task, [6] proposed a divisive hard cluster... |

246 | Scaling Clustering Algorithms to Large Databasesâ€ť, Microsoft Research
- Bradley, Fayyad, et al.
(Show Context)
Citation Context ... is superior to Algorithms DITC prior and DITC LocalSearch. Note that all our algorithms are deterministic since we choose initial cluster distributions that are "maximally" far apart from each other =-=[3]-=-, see Section 3 for details. Algorithm DITC prior cures the problem of sparsity to some extent and its results are superior to DITC, for example, Table 3 shows the confusion matrices resulting from th... |

234 | A.K.: Distributional clustering of words for text classification
- Baker, McCallum
- 1998
(Show Context)
Citation Context ...rmation. Information bottleneck algorithm yields a "soft" clustering of the data using a procedure which similar to the deterministic annealing [19] A greedy agglomerative hard clustering was used in =-=[1, 23]-=- to cluster words in order to reduce feature size for supervised text classification. For the same task, [6] proposed a divisive hard clustering algorithm that directly minimizes the loss in mutual in... |

153 | Impact of similarity measures on web-page clustering
- Strehl, Ghosh, et al.
- 2000
(Show Context)
Citation Context ...are merged or split, some notion of similarity/distance measure is often defined. Similarity/distance measures between data items can be defined based on a measure, such as Euclidean distance, cosine =-=[24]-=- or based on boolean or categorical values [10]. Alternative clustering algorithms are based on graph partitioning [14]. For the case of non-negative, co-occurrence data, our information-theoretic fra... |

150 | Document clustering using word clusters via the information bottleneck method
- Slonim, Tishby
(Show Context)
Citation Context ...the slower local search procedure; hence DITC PLS (the algorithm in Figure 6) is our method of choice. We now compare our Algorithm DITC PLS with previously proposed information-theoretic algorithms. =-=[22]-=- proposed the use of an agglomerative algorithm that first clusters words, and then uses this clustered feature space to cluster documents using the same agglomerative information bottleneck method. M... |

108 | A divisive informationtheoretic feature clustering algorithm for text classification
- Dhillon, Mallela, et al.
(Show Context)
Citation Context ... where non-negative co-occurrence data is available. A novel formulation poses the clustering problem as one in information theory: find the clustering that minimizes the loss in (mutual) information =-=[21, 6]-=-. This information-theoretic formulation leads to a "natural" divisive clustering algorithm that uses relative entropy as the measure of similarity and monotonically reduces the loss in mutual informa... |

97 | Unsupervised document classification using sequential information maximization
- Slonim, Friedman, et al.
(Show Context)
Citation Context ... where non-negative co-occurrence data is available. A novel formulation poses the clustering problem as one in information theory: find the clustering that minimizes the loss in (mutual) information =-=[21, 6]-=-. This information-theoretic formulation leads to a "natural" divisive clustering algorithm that uses relative entropy as the measure of similarity and monotonically reduces the loss in mutual informa... |

68 | The power of word clusters for text classification
- Slonim, Tishby
- 2001
(Show Context)
Citation Context ...rmation. Information bottleneck algorithm yields a "soft" clustering of the data using a procedure which similar to the deterministic annealing [19] A greedy agglomerative hard clustering was used in =-=[1, 23]-=- to cluster words in order to reduce feature size for supervised text classification. For the same task, [6] proposed a divisive hard clustering algorithm that directly minimizes the loss in mutual in... |

41 |
News weeder: Learning to filter netnews
- Lang
- 1995
(Show Context)
Citation Context ...tion-theoretic algorithm to the task of clustering document collections using word-document co-occurrence data. 6.1 Data sets For our test data, we use various subsets of the 20-newsgroup data (NG20) =-=[16]-=- and the smart collection (ftp://ftp.cs.cornell.edu/pub/smart). 9sMED CRAN CISI ^y1 847 41 275 ^y2 142 954 86 ^y3 44 405 1099 MED CRAN CISI ^y1 1016 1 2 ^y2 1 1389 1 ^y3 16 9 1457 DITC results DITC pr... |

40 | Using Machine Learning to Improve Information Access - Sahami - 1998 |

13 | Learning Simple Relations: Theory and Applications
- Berkhin, Becher
- 2002
(Show Context)
Citation Context ...sed in [22] to cluster documents after clustering words. The work in [9] extended the above work to repetitively cluster documents and then words. Methods based on sequential optimization are used in =-=[2, 21]-=-. As we demonstrate in Section 6, our proposed algorithm yields better clusterings than the above approaches, while being more computationally efficient. Information-theoretic methods have been used f... |

3 |
Distributional clustering of English words
- McGraw-Hill
- 1997
(Show Context)
Citation Context ...class conditional word independence and computes the most probable class for test document d as argmax^yp(^y) Y x p(xj^y)N(x;d) (2) where N (x; d) is the number of occurrences of word x in document d =-=[17]-=-. Taking logarithms in (2), dividing throughout by the length of the document jdj and adding the entropy \GammasP p(xjd) log p(xjd) (where p(xjd) = N (x; d)=jdj), the Naive Bayes rule (2) is transform... |

1 |
Elements of Information Theory.John Wiley
- Cover, Thomas
- 1991
(Show Context)
Citation Context ...e. to minimize I(X; Y ) \GammasI(X; ^Y ) = X ^y X y2^y p(y)KL(p(Xjy); p(Xj^y)); (1)where I(X; Y ) is mutual information between random vari-able X and Y and KL stands for Kullback-Leibler diver-gence =-=[1]-=-. The above expression for the loss in mutual information suggests a "natural" divisive clustering algo-rithm (DITC), which iteratively (i) re-partitions the distributions p(Xjy) by their closeness in... |