## Discriminative Clustering by Regularized Information Maximization

Citations: | 11 - 1 self |

### BibTeX

@MISC{Gomes_discriminativeclustering,

author = {Ryan Gomes and Andreas Krause and Pietro Perona},

title = {Discriminative Clustering by Regularized Information Maximization},

year = {}

}

### OpenURL

### Abstract

Is there a principled way to learn a probabilistic discriminative classifier from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classifier. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classifier complexity. The approach can flexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1

### Citations

1097 | On spectral clustering: Analysis and an algorithm
- Ng, Jordan, et al.
- 2001
(Show Context)
Citation Context ...aries or distinctions between categories. Fewer assumptions about the nature of categories are made, making these methods powerful and flexible in real world applications. Spectral graph partitioning =-=[1]-=- and maximum margin clustering [2] are example discriminative clustering methods. A disadvantage of existing discriminative approaches is that they lack a probabilistic foundation, making them potenti... |

947 | Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories
- Lazebnik, Schmid, et al.
(Show Context)
Citation Context ... on an image clustering task with 350 images from four Caltech-256 [14] categories (Faces-Easy, Motorbikes, Airplanes, T-Shirt) for a total of N = 1400 images. We use the Spatial Pyramid Match kernel =-=[15]-=- computed between every pair of images. We sweep RIM’s λ parameter across [ 0.125 4 N , N ]. The results are summarized in figure 3. Overall, the clusterings that best match ground truth are given by ... |

521 |
Comparing partitions
- Hubert, Arabie
- 1985
(Show Context)
Citation Context ...to provide this. We evaluate unsupervised clustering performance in terms of how well the discovered clusters reflect known ground truth labels of the dataset. We report the Adjusted Rand Index (ARI) =-=[11]-=- between an inferred clustering and the ground truth categories. ARI has a maximum value of 1 when two clusterings are identical. We evaluated a number of other measures for comparing clusterings to g... |

488 | On the limited memory BFGS method for large scale optimization
- Liu, Nocedal
- 1989
(Show Context)
Citation Context ...e gradient requires only O(NKD) operations since the terms ∑ c pci log pci ˆpc computed once and reused in each partial derivative expression. ) ic k may be The above gradients are used in the L-BFGS =-=[6]-=- quasi-Newton optimization algorithm 1 . We find empirically that the optimization usually converges within a few hundred iterations. When specialized 1 We used Mark Schmidt’s implementation at http:/... |

434 | The information bottleneck method
- Tishby, Pereira, et al.
- 1999
(Show Context)
Citation Context ...upervised learning. 7 Related Work Our work has connections to existing work in both unsupervised learning and semi-supervised classification. Unsupervised Learning. The information bottleneck method =-=[19]-=- learns a conditional model p(y|x) where the labels y form a lossy representation of the input space x, while preserving information about a third “relevance” variable z. The method maximizes I(y; z) ... |

396 | Cluster ensembles — a knowledge reuse framework for combining multiple partitions
- Strehl, Ghosh
(Show Context)
Citation Context ... a maximum value of 1 when two clusterings are identical. We evaluated a number of other measures for comparing clusterings to ground truth including mutual information, normalized mutual information =-=[12]-=-, and cluster impurity [13]. We found that the relative rankings of the algorithms were the same as indicated by ARI. We evaluate the performance of each algorithm while varying the number of clusters... |

245 |
Caltech-256 object category dataset
- Griffin, Holub, et al.
(Show Context)
Citation Context ...regularization parameter λ and allow the algorithm to discover the final number of clusters. Image Clustering. We test the algorithms on an image clustering task with 350 images from four Caltech-256 =-=[14]-=- categories (Faces-Easy, Motorbikes, Airplanes, T-Shirt) for a total of N = 1400 images. We use the Spatial Pyramid Match kernel [15] computed between every pair of images. We sweep RIM’s λ parameter ... |

130 |
and Nello Cristianini. Kernel Methods for Pattern Analysis
- Shawe-Taylor
- 2004
(Show Context)
Citation Context ... clustered according to arg maxk p(y = k|x, W). We compare RIM against the spectral clustering (SC) algorithm of [1], the fast maximum margin clustering (MMC) algorithm of [9], and kernelized k-means =-=[10]-=-. MMC is a binary clustering algorithm. We use the recursive scheme outlined by [9] to extend the approach to multiple categories. The MMC algorithm requires an initial clustering estimate for initial... |

122 | Maximum entropy discrimination
- Jaakkola, Meila, et al.
- 1999
(Show Context)
Citation Context ...rm. We can capture prior beliefs about the average label distribution by substituting a reference distribution D(y; γ) in place of U (γ is a parameter that may be fixed or optimized during learning). =-=[7]-=- also use relative entropy as a means of enforcing prior beliefs, although not with respect to class distributions in multi-class classification problems. This construction may be used in a clustering... |

121 | Semi-supervised classification by low density separation
- Chapelle, Zien
(Show Context)
Citation Context ...irst is that the discriminative model’s decision boundaries should not be located in regions of the input space that are densely populated with datapoints. This is often termed the cluster assumption =-=[5]-=-, and also corresponds to the idea that datapoints should be classified with large margin. Grandvalet & Bengio [3] show that a conditional entropy term − 1 ∑ N i H{p(y|xi, W)} very effectively capture... |

81 | Semi-supervised learning by entropy minimization
- Grandvalet, Bengio
- 2004
(Show Context)
Citation Context ...incipled probabilistic approach to discriminative clustering, by formalizing the problem as unsupervised learning of a conditional probabilistic model. We generalize the work of Grandvalet and Bengio =-=[3]-=- and Bridle et al. [4] in order to learn probabilistic classifiers that are appropriate for multi-class discriminative clustering, as explained in Section 2. We identify two fundamental, competing qua... |

79 | A hierarchical Bayesian language model based on Pitman-Yor processes
- Teh
- 2006
(Show Context)
Citation Context ...ions in multi-class classification problems. This construction may be used in a clustering task in which we believe that the cluster sizes obey a power law distribution as, for example, considered by =-=[8]-=- who use the Pitman-Yor process for nonparametric language modeling. Simple manipulation yields the following objective: F (W; X, λ, γ) = IW{x; y} − H{ˆp(y; W)||D(y; γ)} − R(W; λ) where H{ˆp(y; W)||D(... |

50 | Unsupervised and semi-supervised multi-class support vector machines
- Xu, Schuurmans
- 2005
(Show Context)
Citation Context ...gories. Fewer assumptions about the nature of categories are made, making these methods powerful and flexible in real world applications. Spectral graph partitioning [1] and maximum margin clustering =-=[2]-=- are example discriminative clustering methods. A disadvantage of existing discriminative approaches is that they lack a probabilistic foundation, making them potentially unsuitable in applications th... |

47 | CLUE: Cluster-Based Retrieval of Images by Unsupervised learning
- Chen, Wang, et al.
- 2005
(Show Context)
Citation Context ...two clusterings are identical. We evaluated a number of other measures for comparing clusterings to ground truth including mutual information, normalized mutual information [12], and cluster impurity =-=[13]-=-. We found that the relative rankings of the algorithms were the same as indicated by ARI. We evaluate the performance of each algorithm while varying the number of clusters that are discovered, and w... |

39 | Maximum margin clustering made practical
- Zhang, Tsang, et al.
- 2007
(Show Context)
Citation Context ... Unlabeled examples are then clustered according to arg maxk p(y = k|x, W). We compare RIM against the spectral clustering (SC) algorithm of [1], the fast maximum margin clustering (MMC) algorithm of =-=[9]-=-, and kernelized k-means [10]. MMC is a binary clustering algorithm. We use the recursive scheme outlined by [9] to extend the approach to multiple categories. The MMC algorithm requires an initial cl... |

36 |
On information regularization
- Corduneanu, Jaakkola
- 2003
(Show Context)
Citation Context ...ethods required are much more expensive than our approach. Semi-supervised Classification. Our semi-supervised objective is related to [3], as discussed in section 5.1. Another semi-supervised method =-=[23]-=- uses mutual information as a regularizing term to be minimized, in contrast to ours which attempts to maximize mutual information. The assumption underlying [23] is that any information between the l... |

31 |
DIFFRAC: a discriminative and flexible framework for clustering
- Bach, Harchaoui
- 2007
(Show Context)
Citation Context ... a measure of dependence, whereas we use Mutual Information. There is also an unsupervised variant of the Support Vector Machine, called max-margin clustering. Like our approach, the works of [2] and =-=[21]-=- use notions of class balance, seperation, and regularization to learn unsupervised discriminative classifiers. However, they are formulated in the max-margin framework rather than our probabilistic a... |

27 | Distinguishing enzyme structures from non-enzymes without alignments
- Dobson, Doig
- 2007
(Show Context)
Citation Context ... requires 44-51 seconds per run depending on the number of clusters specified. Molecular Graph Clustering. We further test RIM’s unsupervised learning performance on two molecular graph datasets. D&D =-=[16]-=- contains N = 1178 protein structure graphs with binary ground truth labels indicating whether or not they function as enzymes. NCI109 [17] is composed of N = 4127 compounds labeled according to wheth... |

27 |
Information-based clustering
- Slonim, Atwal, et al.
- 2005
(Show Context)
Citation Context ...a third “relevance” variable z. The method maximizes I(y; z) − λI(x; y), whereas we maximize the information between y and x while constraining complexity with a parametric regularizer. The method of =-=[20]-=- aims to maximize a similarity measure computed between members within the same cluster while penalizing the mutual information between the cluster label y and the input x. Again, mutual information i... |

26 | Comparison of descriptor spaces for chemical compound retrieval and classification
- Wale, Watson, et al.
(Show Context)
Citation Context ...ed learning performance on two molecular graph datasets. D&D [16] contains N = 1178 protein structure graphs with binary ground truth labels indicating whether or not they function as enzymes. NCI109 =-=[17]-=- is composed of N = 4127 compounds labeled according to whether or not they are active in an anti-cancer screening. We use the subtree kernel developed by [18] with subtree height of 1. For D&D, we sw... |

15 | Fast subtree kernels on graphs
- Shervashidze, Borgwardt
- 2009
(Show Context)
Citation Context ... or not they function as enzymes. NCI109 [17] is composed of N = 4127 compounds labeled according to whether or not they are active in an anti-cancer screening. We use the subtree kernel developed by =-=[18]-=- with subtree height of 1. For D&D, we sweep RIM’s lambda parameter through the range [ 0.001 0.05 N , N ] and for NCI we sweep through the interval [ 0.001 1 N , N ]. Results are summarized in Figure... |

8 | Unsupervised classifiers, mutual information and ‘Phantom Targets - Bridle, MacKay - 1992 |

3 | A dependence maximization view of clustering
- Song, Gretton, et al.
- 2007
(Show Context)
Citation Context ...rs within the same cluster while penalizing the mutual information between the cluster label y and the input x. Again, mutual information is used to enforce a lossy representation of y|x. Song et al. =-=[22]-=- also view clustering as maximization of the dependence between the input variable and output label variable. They use the Hilbert-Schmidt Independence Criterion as a measure of dependence, whereas we... |