## On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution

Citations: | 6 - 5 self |

### BibTeX

@MISC{Sugiyama_oninformation-maximization,

author = {Masashi Sugiyama and Makoto Yamada and Manabu Kimura and Hirotaka Hachiya},

title = {On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution},

year = {}

}

### OpenURL

### Abstract

Information-maximization clustering learns a probabilistic classifier in an unsupervised manner so that mutual information between feature vectors and cluster assignments is maximized. A notable advantage of this approach is that it only involves continuous optimization of model parameters, which is substantially easier to solve than discrete optimization of cluster assignments. However, existing methods still involve nonconvex optimization problems, and therefore finding a good local optimal solution is not straightforward in practice. In this paper, we propose an alternative informationmaximization clustering method based on a squared-loss variant of mutual information. This novel approach gives a clustering solution analytically in a computationally efficient way via kernel eigenvalue decomposition. Furthermore, we provide a practical model selection procedure that allows us to objectively optimize tuning parameters included in the kernel function. Through experiments, we demonstrate the usefulness of the proposed approach. 1.

### Citations

9231 |
Elements of Information Theory
- Cover, Thomas
- 1990
(Show Context)
Citation Context ...where p ∗ (x,y) denotes the joint density of x and y, and p ∗ (y) is the marginal probability of y. SMIisthe Pearson divergence (Pearson, 1900) from p ∗ (x,y)to p ∗ (x)p ∗ (y), while the ordinary MI (=-=Cover & Thomas, 2006-=-) is the Kullback-Leibler divergence (Kullback & Leibler, 1951) from p ∗ (x,y)top ∗ (x)p ∗ (y): ∫ MI := c ∑ y=1 p ∗ (x,y)log p∗ (x,y) p∗ (x)p∗ dx. (2) (y) The Pearson divergence and the Kullback-Leibl... |

5114 |
Matrix Analysis
- Horn, Johnson
- 1985
(Show Context)
Citation Context ....,αy,n) ⊤ ,andKi,j := K(xi, xj). For each cluster y, we maximize α⊤ y K 2 αy under2 ‖αy‖ = 1. Since this is the Rayleigh quotient, the maximizer is given by the normalized principal eigenvector of K (=-=Horn & Johnson, 1985-=-). To avoid all the solutions {αy} c y=1 to be reduced to the same principal eigenvector, we impose their mutual orthogonality: α⊤ y αy ′ =0fory̸= y′ . Then the solutions are given by the normalized e... |

2640 |
Density estimation for statistical and data analysis. Monographs on statistics and applied probability
- Silverman
- 1986
(Show Context)
Citation Context ...ift (Fukunaga & Hostetler, 1975) is a non-parametric clustering method based on the modes of the data-generating probability density. In the blurring mean-shift algorithm, a kernel density estimator (=-=Silverman, 1986-=-) is used for modeling the data-generating probability density: ̂p(x) = 1 n n∑ i=1 ( K ‖x − xi‖ 2 /σ 2) , where K(ξ) is a kernel function such as a Gaussian kernel K(ξ) =e−ξ/2. Taking the derivative o... |

1047 |
Spectral Graph Theory
- Chung
- 1997
(Show Context)
Citation Context ...rmore, the above update rule can be expressed in a matrix form as X ←− XP,whereX = (x1,...,xn) is a sample matrix and P := WD −1 is a stochastic matrix of the random walk in a graph with adjacency W (=-=Chung, 1997-=-). D is defined as ,On Information-Maximization Clustering Di,i := ∑n j=1 Wi,j and Di,j =0fori̸= j. If P is independent of X, the above iterative algorithm corresponds to the power method (Golub & Lo... |

573 |
Comparing partitions
- Hubert, Arabie
- 1985
(Show Context)
Citation Context ...models (MIC) (Gomes et al., 2010) with model selection by maximumlikelihood MI (Suzuki et al., 2008), and the proposed SMIC. The clustering performance was evaluated by the adjusted Rand index (ARI) (=-=Hubert & Arabie, 1985-=-) between inferred cluster assignments and the ground truth categories. Larger ARI values mean better performance, and ARI takes its maximum value 1 when two sets of cluster assignments are identical.... |

428 | shift, mode seeking, and clustering
- Cheng, “Mean
- 1995
(Show Context)
Citation Context ...ystematic way to determine σ, which is restrictive compared with the proposed method. Another critical drawback of the blurring mean-shift algorithm is that it eventually converges to a single point (=-=Cheng, 1995-=-), and therefore a sensible stopping criterion is necessary in practice. Although Carreira-Perpiñán (2006) gave a useful heuristic for stopping the iteration, it is not clear whether this heuristic al... |

328 |
The estimation of the gradient of a density function, with applications in pattern recognition
- Fukunaga, Hostetler
- 1975
(Show Context)
Citation Context ... kernel function. Spectral clustering (Shi & Malik, 2000) first unfolds non-linear data manifolds by a spectral embedding method, and then performs k-means in the embedded space. Blurring mean-shift (=-=Fukunaga & Hostetler, 1975-=-) uses a non-parametric kernel density estimator for modeling the data-generating probability density and finds clusters based on the modes of the estimated density. Discriminative clustering (Xu et a... |

220 |
Information-type measures of difference of probability distributions and indirect observations
- Csiszár
- 1967
(Show Context)
Citation Context ...(x)p∗ dx. (2) (y) The Pearson divergence and the Kullback-Leibler divergence both belong to the class of Ali-Silvey-Csiszár divergences (which is also known as f-divergences, see (Ali & Silvey, 1966; =-=Csiszár, 1967-=-)), and thus they share similar properties. For example, SMI is nonnegative and takes zero if and only if x and y are statistically independent, as the ordinary MI. In the existing information-maximiz... |

174 |
A general class of coefficients of divergence of one distribution from another
- Ali, Silvey
- 1966
(Show Context)
Citation Context ...x,y)log p∗ (x,y) p∗ (x)p∗ dx. (2) (y) The Pearson divergence and the Kullback-Leibler divergence both belong to the class of Ali-Silvey-Csiszár divergences (which is also known as f-divergences, see (=-=Ali & Silvey, 1966-=-; Csiszár, 1967)), and thus they share similar properties. For example, SMI is nonnegative and takes zero if and only if x and y are statistically independent, as the ordinary MI. In the existing info... |

149 |
On a criterion that a given system of deviations from the probable in the case of correlated systems of variables is such that it can be reasonably supposed to have arisen in random sampling
- Pearson
- 1900
(Show Context)
Citation Context ...y) p∗ (x)p∗ ) 2 − 1 dx, (y) (1)On Information-Maximization Clustering where p ∗ (x,y) denotes the joint density of x and y, and p ∗ (y) is the marginal probability of y. SMIisthe Pearson divergence (=-=Pearson, 1900-=-) from p ∗ (x,y)to p ∗ (x)p ∗ (y), while the ordinary MI (Cover & Thomas, 2006) is the Kullback-Leibler divergence (Kullback & Leibler, 1951) from p ∗ (x,y)top ∗ (x)p ∗ (y): ∫ MI := c ∑ y=1 p ∗ (x,y)l... |

139 | Kernel kmeans, spectral clustering and normalized cuts
- Dhillon, Guan, et al.
- 2004
(Show Context)
Citation Context ...the principal eigenvec1 1 − tors of D 2 − WD 2 , followed by normalization. Note that spectral clustering was shown to be equivalent to a weighted variant of kernel k-means with some specific kernel (=-=Dhillon et al., 2004-=-). The performance of spectral clustering depends heavily on the choice of sample-sample similarity Wi,j. Zelnik-Manor & Perona (2005) proposed a useful unsupervised heuristic to determine the similar... |

84 | Maximum margin clustering
- Xu, Neufeld, et al.
- 2005
(Show Context)
Citation Context ...r, 1975) uses a non-parametric kernel density estimator for modeling the data-generating probability density and finds clusters based on the modes of the estimated density. Discriminative clustering (=-=Xu et al., 2005-=-; Bach & Harchaoui, 2008) learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized. Dependencemaximization clustering (Song et al.,... |

64 | Mercer kernel-based clustering in feature space - Girolami |

56 | Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis - Sugiyama - 1996 |

34 |
Diffrac: a discriminative and flexible framework for clustering
- Bach, Harchaoui
- 2007
(Show Context)
Citation Context ...on-parametric kernel density estimator for modeling the data-generating probability density and finds clusters based on the modes of the estimated density. Discriminative clustering (Xu et al., 2005; =-=Bach & Harchaoui, 2008-=-) learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized. Dependencemaximization clustering (Song et al., 2007; Faivishevsky & Go... |

26 | Gaussian mean-shift is an em algorithm - Carreira-Perpinan |

24 | Mutual information estimation reveals global associations between stimuli and biological processes - Suzuki, Sugiyama, et al. - 2009 |

22 | Fast nonparametric clustering with gaussian blurring mean-shift - Carreira-Perpinan - 2006 |

17 | Approximating mutual information by maximum likelihood density ratio estimation - Suzuki, Sugiyama, et al. - 2008 |

12 | Discriminative clustering by regularized information maximization
- Krause, Perona, et al.
- 2010
(Show Context)
Citation Context ...n Clustering ually (although the magic number ‘7’ was shown to work well in their experiments). Another line of clustering framework called information-maximization clustering (Agakov & Barber, 2006; =-=Gomes et al., 2010-=-) exhibited the state-of-the-art performance. In this informationmaximization approach, probabilistic classifiers such as a kernelized Gaussian classifier (Agakov & Barber, 2006) and a kernel logistic... |

7 | A nonparametric information theoretic clustering algorithm
- Faivishevsky, Goldberger
- 2010
(Show Context)
Citation Context ...& Harchaoui, 2008) learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized. Dependencemaximization clustering (Song et al., 2007; =-=Faivishevsky & Goldberger, 2010-=-) determines cluster assignments so that their dependence on input data is maximized. These non-linear clustering techniques would be capable of handling highly complex real-world data. However, they ... |

1 |
A dependence maximization view of clustering. ICML
- Song, Smola, et al.
- 2007
(Show Context)
Citation Context ...et al., 2005; Bach & Harchaoui, 2008) learns a discriminative classifier for separating clusters, where class labels are also treated as parameters to be optimized. Dependencemaximization clustering (=-=Song et al., 2007-=-; Faivishevsky & Goldberger, 2010) determines cluster assignments so that their dependence on input data is maximized. These non-linear clustering techniques would be capable of handling highly comple... |