## A Probabilistic Framework for Semi-Supervised Clustering (2004)

### Cached

### Download Links

Citations: | 182 - 12 self |

### BibTeX

@INPROCEEDINGS{Basu04aprobabilistic,

author = {Sugato Basu},

title = {A Probabilistic Framework for Semi-Supervised Clustering},

booktitle = {},

year = {2004},

pages = {59--68}

}

### Years of Citing Articles

### OpenURL

### Abstract

Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.

### Citations

8134 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...well as the cluster labels for the points are unknown in a clustering setting, maximizing Eqn.(5) is an “incomplete-data problem”, for which a popular solution method is Expectation Maximization (EM) =-=[16]-=-. It is well-known that KMeans is equivalent to an EM algorithm with hard clustering assignments [26, 6, 3]. Section 3.2 describes a K-Means-type hard partitional clustering algorithm, HMRF-KMEANS, th... |

7067 |
Probabilistic Reasoning in Intelligence Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...odel [36]. There exist several techniques for computing cluster assignments that approximate the optimal solution in this framework, e.g., iterated conditional modes (ICM) [9, 40], belief propagation =-=[34, 36]-=-, and linear programming relaxation [28]. We follow the ICM approach, which is a greedy strategy to sequentially update the cluster assignment of each point, keeping the assignments for the other poin... |

3737 |
Stochastic relaxation, Gibbs distribution and the Bayesian distribution of images
- Geman, Geman
- 1984
(Show Context)
Citation Context ...ticular cluster label configuration L to be the joint event L = {li} N i=1 . By the Hammersley-Clifford theorem [22], the probability of a label configuration can be expressed as a Gibbs distribution =-=[21]-=-, so that Pr(L) = 1 exp Z1 � −V (L) � = 1 Z1 exp � − ∑ VN i N i∈N (L)� where N is the set of all neighborhoods, Z1 is a normalizing constant, and V (L) is the overall label configuration potential fun... |

2368 | Modern Information Retrieval
- BAEZA-YATES, B
- 1999
(Show Context)
Citation Context ...istances, e.g., KL divergence. In a number of applications, such as text-clustering using a vector-space model, a directional similarity measure based on the angle between vectors is more appropriate =-=[1]-=-. Consequently, clustering algorithms that utilize distortion measures appropriate for directional data have recently been developed [18, 2]. Our unified semi-supervised clustering framework based on ... |

1866 | Some methods for classification and analysis of multivariate observations
- MacQueen
- 1967
(Show Context)
Citation Context ...type) so that a well-defined cost function, involving a distortion measure between the points and the cluster representatives, is minimized. A popular clustering algorithm in this category is K-Means =-=[29]-=-. Earlier research on semi-supervised clustering has considered supervision in the form of labeled points [6] or constraints [38, 39, 5]. In this paper, we will be considering the model where supervis... |

1822 |
Cluster analysis and display of genome-wide expression patterns
- Eisen, Spellman, et al.
- 1998
(Show Context)
Citation Context ...t kinds of gene representations, for which different clustering distance measures would be appropriate, e.g., Pearson’s correlation would be an appropriate distortion measure for gene microarray data =-=[20]-=-, I-divergence would be useful for the phylogenetic profile representation of genes [30], etc. We plan to run experiments for clustering these datasets using the HMRF-KMEANS algorithm, where the const... |

1245 | Combining labeled and unlabeled data with co-training
- BLUM, T
- 1998
(Show Context)
Citation Context ...e to generate, since labeling typically requires human expertise. Consequently, semi-supervised learning, which uses both labeled and unlabeled data, has become a topic of significant recent interest =-=[11, 24, 33]-=-. In this paper, we focus on semi-supervised clustering, where the performance of unsupervised clustering algorithms is improved with limited amounts of supervision in the form of labels on the data o... |

926 | On the statistical analysis of dirty pictures
- Besag
- 1986
(Show Context)
Citation Context ...le in any non-trivial HMRF model [36]. There exist several techniques for computing cluster assignments that approximate the optimal solution in this framework, e.g., iterated conditional modes (ICM) =-=[9, 40]-=-, belief propagation [34, 36], and linear programming relaxation [28]. We follow the ICM approach, which is a greedy strategy to sequentially update the cluster assignment of each point, keeping the a... |

800 | Text classification from labeled and unlabeled documents using EM
- Nigam, McCallum, et al.
- 2000
(Show Context)
Citation Context ...e to generate, since labeling typically requires human expertise. Consequently, semi-supervised learning, which uses both labeled and unlabeled data, has become a topic of significant recent interest =-=[11, 24, 33]-=-. In this paper, we focus on semi-supervised clustering, where the performance of unsupervised clustering algorithms is improved with limited amounts of supervision in the form of labels on the data o... |

770 | A view of the EM algorithm that justifies incremental, sparse, and other variants
- Neal, Hinton
- 1999
(Show Context)
Citation Context ... distortion measure D is updated in the M-step to reduce the objective function simultaneously by transforming the space in which data lies. Note that this corresponds to the generalized EM algorithm =-=[32, 16]-=-, where the objective function is reduced but not necessarily minimized in the M-step. Effectively, the E-step minimizes Jobj over cluster assignments L, the M-step (A) minimizes Jobj over cluster rep... |

681 | Transductive inference for text classification using support vector machines
- Joachims
- 1999
(Show Context)
Citation Context ...e to generate, since labeling typically requires human expertise. Consequently, semi-supervised learning, which uses both labeled and unlabeled data, has become a topic of significant recent interest =-=[11, 24, 33]-=-. In this paper, we focus on semi-supervised clustering, where the performance of unsupervised clustering algorithms is improved with limited amounts of supervision in the form of labels on the data o... |

548 | Distributional clustering of english words
- Pereira, Tishby, et al.
- 1993
(Show Context)
Citation Context ...erized I-Divergence In certain domains, data is described by probability distributions, e.g. text documents can be represented as probability distributions over words generated by a multinomial model =-=[35]-=-. KL-divergence is a widely used distance measure for such data: DKL(xi,xj) = ∑ d m=1 xim log xim x jm , where xi and x j are probability distributions over d events: ∑ d m=1 xim = ∑ d m=1 x jm = 1. I... |

507 | Distance metric learning, with application to clustering with side information
- Xing, Ng, et al.
- 2004
(Show Context)
Citation Context ...work on distance-based semi-supervised clustering with pairwise constraints, Cohn et al. [13] used gradient descent for weighted Jensen-Shannon divergence in the context of EM clustering. Xing et al. =-=[39]-=- utilized a combination of gradient descent and iterative projections to learn a Mahalanobis distance for K-Means clustering. The Redundant Component Analysis (RCA) algorithm used only must-link const... |

328 | Constrained K-means clustering with background knowledge
- Wagstaff, Cardie, et al.
- 2001
(Show Context)
Citation Context ...r, we focus on semi-supervised clustering, where the performance of unsupervised clustering algorithms is improved with limited amounts of supervision in the form of labels on the data or constraints =-=[38, 6, 27, 39, 7]-=-. Existing methods for semi-supervised clustering fall into two general categories which we call constraint-based and distancebased. Constraint-based methods rely on user-provided labels or constraint... |

309 | Clustering with bregman divergences
- Banerjee, Merugu, et al.
- 2004
(Show Context)
Citation Context ...divergences each cluster representative calculated in the M-step of the EM algorithm is equivalent to the expectation value over the points in that cluster, which is essentially their arithmetic mean =-=[3]-=-. Additionally, it has been experimentally demonstrated that for distribution-based clustering, smoothing cluster representatives by a prior using a deterministic annealing schedule leads to considera... |

302 | Concept decompositions for large sparse text data using clustering
- Dhillon, Modha
(Show Context)
Citation Context ... measure based on the angle between vectors is more appropriate [1]. Consequently, clustering algorithms that utilize distortion measures appropriate for directional data have recently been developed =-=[18, 2]-=-. Our unified semi-supervised clustering framework based on HMRFs is also applicable to such directional similarity measures. To summarize, the proposed approach aids unsupervised clustering by incorp... |

238 | R.J.: Adaptive duplicate detection using learnable string similarity measures
- Bilenko, Mooney
- 2003
(Show Context)
Citation Context ...bels or constraints in the supervised data. Several adaptive distance measures have been used for semisupervised clustering, including string-edit distance trained using Expectation Maximization (EM) =-=[10]-=-, KL divergence trained using gradient descent [13], Euclidean distance modified by a shortestpath algorithm [27], or Mahalanobis distances trained using convex optimization [39]. We propose a princip... |

220 | Correlation clustering
- Bansal, Blum, et al.
(Show Context)
Citation Context ...ustering algorithm that has a heuristically motivated objective function [38]. Our method, on the other hand, has an underlying probabilistic model based on Hidden Markov Random Fields. Bansal et al. =-=[4]-=- also proposed a framework for pairwise constrained clustering, but their model performs clustering using only the constraints, whereas our formulation uses both constraints and an underlying distorti... |

204 |
Directional Statistics
- Mardia, Jupp
- 2000
(Show Context)
Citation Context ...s known as Bregman divergences [3]. Another popular class of distortion measures includes directional similarity functions such as normalized dot product (cosine similarity) and Pearson’s correlation =-=[31]-=-. Selection of the most appropriate distortion measure for a clustering task should take into account intrinsic properties of the dataset. For example, Euclidean distance is most appropriate for low-d... |

189 |
A best possible heuristic for the k-center problem
- Hochbaum, Shmoys
- 1985
(Show Context)
Citation Context ...borhoods are selected as initial clusters using the clustering distortion measure. Farthest-first traversal is a good heuristic for initialization in prototype-based partitional clustering algorithms =-=[23]-=-. The goal in farthest-first traversal is to find K points that are maximally separated from each other in terms of a given distance function. In our case, we apply a weighted variant of farthest-firs... |

165 | Markov random fields with efficient approximations
- Boykov, Veksler, et al.
- 1998
(Show Context)
Citation Context ...potential function V in the first term of Eqn.(5). In previous work, only must-linked points were considered in the neighborhood of a Markov Random Field with the generalized Potts potential function =-=[12, 28]-=-. In this potential function, the must-link penalty is fM(xi,xj) = wi j [li �= l j], where wi j is the cost for violating the must-link constraint (i, j), ands(6) is the indicator function (s[true] = ... |

158 | Approximation algorithms for classification problems with pairwise relationships: Metric labeling and markov random fields
- Kleinberg, Tardos
- 1999
(Show Context)
Citation Context ...potential function V in the first term of Eqn.(5). In previous work, only must-linked points were considered in the neighborhood of a Markov Random Field with the generalized Potts potential function =-=[12, 28]-=-. In this potential function, the must-link penalty is fM(xi,xj) = wi j [li �= l j], where wi j is the cost for violating the must-link constraint (i, j), ands(6) is the indicator function (s[true] = ... |

154 | Impact of similarity measures on web-page clustering
- STREHL, STREHL, et al.
- 2000
(Show Context)
Citation Context ...ality of the clustering with respect to a given underlying class labeling of the data: it measures how closely the clustering algorithm could reconstruct the underlying label distribution in the data =-=[37, 19]-=-. If C is the random variable denoting the cluster assignments of the points and K is the random variable denoting the underlying class labels on the points [2], then the NMI measure is defined as: NM... |

152 | From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering
- KLEIN, KAMVAR, et al.
- 2002
(Show Context)
Citation Context ...r, we focus on semi-supervised clustering, where the performance of unsupervised clustering algorithms is improved with limited amounts of supervision in the form of labels on the data or constraints =-=[38, 6, 27, 39, 7]-=-. Existing methods for semi-supervised clustering fall into two general categories which we call constraint-based and distancebased. Constraint-based methods rely on user-provided labels or constraint... |

145 | Semi-supervised clustering by seeding - Basu, Banerjee, et al. - 2002 |

133 | Learning distance functions using equivalence relations
- Bar-Hillel, Hertz, et al.
- 2003
(Show Context)
Citation Context ... minimized. A popular clustering algorithm in this category is K-Means [29]. Earlier research on semi-supervised clustering has considered supervision in the form of labeled points [6] or constraints =-=[38, 39, 5]-=-. In this paper, we will be considering the model where supervision is provided in the form of must-link and cannot-link constraints, indicating respectively that a pair of points should be or should ... |

109 |
Discovering molecular pathways from protein interaction and gene expression data
- Segal, Wang, et al.
- 2003
(Show Context)
Citation Context ...r the hidden variables. As a result, computing the assignment of data points to cluster representatives to minimize the objective function is computationally intractable in any non-trivial HMRF model =-=[36]-=-. There exist several techniques for computing cluster assignments that approximate the optimal solution in this framework, e.g., iterated conditional modes (ICM) [9, 40], belief propagation [34, 36],... |

107 |
Markov fields on finite graphs and lattices. Unpublished manuscript
- Hammersley, Clifford
- 1971
(Show Context)
Citation Context ... labels of the points that are must-linked or cannot-linked to xi. Let us consider a particular cluster label configuration L to be the joint event L = {li} N i=1 . By the Hammersley-Clifford theorem =-=[22]-=-, the probability of a label configuration can be expressed as a Gibbs distribution [21], so that Pr(L) = 1 exp Z1 � −V (L) � = 1 Z1 exp � − ∑ VN i N i∈N (L)� where N is the set of all neighborhoods, ... |

101 | Semi-supervised clustering with user feedback. unpublished manuscript
- Cohn, Caruana, et al.
- 2000
(Show Context)
Citation Context ... adaptive distance measures have been used for semisupervised clustering, including string-edit distance trained using Expectation Maximization (EM) [10], KL divergence trained using gradient descent =-=[13]-=-, Euclidean distance modified by a shortestpath algorithm [27], or Mahalanobis distances trained using convex optimization [39]. We propose a principled probabilistic framework based on Hidden Markov ... |

91 | Active semi-supervision for pairwise constrained clustering
- Basu, Banerjee, et al.
- 2004
(Show Context)
Citation Context ...r, we focus on semi-supervised clustering, where the performance of unsupervised clustering algorithms is improved with limited amounts of supervision in the form of labels on the data or constraints =-=[38, 6, 27, 39, 7]-=-. Existing methods for semi-supervised clustering fall into two general categories which we call constraint-based and distancebased. Constraint-based methods rely on user-provided labels or constraint... |

88 | An information-theoretic analysis of hard and soft assignment methods for clustering
- Kearns, Mansour, et al.
- 1997
(Show Context)
Citation Context ...an “incomplete-data problem”, for which a popular solution method is Expectation Maximization (EM) [16]. It is well-known that KMeans is equivalent to an EM algorithm with hard clustering assignments =-=[26, 6, 3]-=-. Section 3.2 describes a K-Means-type hard partitional clustering algorithm, HMRF-KMEANS, that finds a (local) maximum of the above function. The posterior probability Pr(L|X ) in Eqn.(5) has 2 compo... |

70 | Spectral learning
- Kamvar, Klein, et al.
- 2003
(Show Context)
Citation Context ...to learn a Mahalanobis distance using convex optimization [5]. Spectral learning is another recent method that utilizes supervision to transform the clustering distance measure using spectral methods =-=[25]-=-. All these distance learning techniques for clustering train the distance measure first using only supervised data, and then perform clustering on the unsupervised data. In contrast, our method integ... |

65 | Semi-supervised clustering using genetic algorithms
- Demiriz, Bennett, et al.
- 1999
(Show Context)
Citation Context ...ilenko@cs.utexas.edu Raymond J. Mooney Dept. of Computer Sciences University of Texas at Austin Austin, TX 78712 mooney@cs.utexas.edu evaluating clusterings so that it includes satisfying constraints =-=[15]-=-, enforcing constraints during the clustering process [38], or initializing and constraining the clustering based on labeled examples [6]. In distance-based approaches, an existing clustering algorith... |

61 | An information-theoretic external cluster-validity measure
- DOM
- 2001
(Show Context)
Citation Context ...ality of the clustering with respect to a given underlying class labeling of the data: it measures how closely the clustering algorithm could reconstruct the underlying label distribution in the data =-=[37, 19]-=-. If C is the random variable denoting the cluster assignments of the points and K is the random variable denoting the underlying class labels on the points [2], then the NMI measure is defined as: NM... |

36 | Generative model-based clustering of directional data
- Banerjee, Dhillon, et al.
- 2003
(Show Context)
Citation Context ... measure based on the angle between vectors is more appropriate [1]. Consequently, clustering algorithms that utilize distortion measures appropriate for directional data have recently been developed =-=[18, 2]-=-. Our unified semi-supervised clustering framework based on HMRFs is also applicable to such directional similarity measures. To summarize, the proposed approach aids unsupervised clustering by incorp... |

29 |
Localizing proteins in the cell from their phylogenetic profiles
- Marcotte, Xenarios, et al.
- 2000
(Show Context)
Citation Context ...be appropriate, e.g., Pearson’s correlation would be an appropriate distortion measure for gene microarray data [20], I-divergence would be useful for the phylogenetic profile representation of genes =-=[30]-=-, etc. We plan to run experiments for clustering these datasets using the HMRF-KMEANS algorithm, where the constraints will be inferred from protein interaction databases as well as from function path... |

25 | Information theoretic clustering of sparse cooccurrence data
- Dhillon, Guan
- 2003
(Show Context)
Citation Context ...ster conditional probability is a unit variance Gaussian [26]; • xi and µli are probability distributions and D is the KL-divergence: the cluster conditional probability is a multinomial distribution =-=[17]-=-; • xi and µli are vectors of unit length (according to the L2 norm) and D is one minus the dot-product: the cluster conditional probability is a von-Mises Fisher (vMF) distribution with unit concentr... |

19 | Comparing and unifying search-based and similarity-based approaches to semi-supervised clustering
- Basu, Bilenko, et al.
- 2003
(Show Context)
Citation Context ...is objective function. Previously, we proposed a unified approach to semi-supervised clustering that was experimentally shown to produce more accurate clusters than other methods on several data sets =-=[8]-=-. However, this approach is restricted to using Euclidean distance as the clustering distortion measure. In this paper, we show how to generalize that model to handle non-Euclidean measures. Our gener... |

8 | Hidden Markov random field model and segmentation of brain MR images
- Zhang, Brady, et al.
(Show Context)
Citation Context ...le in any non-trivial HMRF model [36]. There exist several techniques for computing cluster assignments that approximate the optimal solution in this framework, e.g., iterated conditional modes (ICM) =-=[9, 40]-=-, belief propagation [34, 36], and linear programming relaxation [28]. We follow the ICM approach, which is a greedy strategy to sequentially update the cluster assignment of each point, keeping the a... |