## A Bayesian model for supervised clustering with the Dirichlet process prior (2005)

### Cached

### Download Links

- [arxiv.org]
- [www.isi.edu]
- [jmlr.csail.mit.edu]
- [www.jmlr.org]
- [jmlr.org]
- [hal3.name]
- [www.umiacs.umd.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 28 - 0 self |

### BibTeX

@ARTICLE{Iii05abayesian,

author = {Hal Daumé Iii and Daniel Marcu and William Cohen},

title = {A Bayesian model for supervised clustering with the Dirichlet process prior},

journal = {Journal of Machine Learning Research},

year = {2005}

}

### Years of Citing Articles

### OpenURL

### Abstract

We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these “reference types”) that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple—but general—parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics.

### Citations

787 |
A Bayesian Analysis of Some Nonparametric Problems
- FERGUSON
- 1973
(Show Context)
Citation Context ...all k) , then the joint distribution of random probabilities (P µ (B1), . . . , P µ (BK)) is distributed according to Dir(µ(B1), . . . , µ(BK)), where Dir denotes the standard Dirichlet distribution (=-=Ferguson, 1973-=-, 1974). In words: P µ is a Dirichlet process if it behaves as if it were a Dirichlet distribution on any finite partition of the original space. It is typically useful to write µ = αG0, where α = � Ω... |

549 | S.J.: Distance Metric Learning with Application to Clustering with Side-Information
- Xing, Ng, et al.
- 2003
(Show Context)
Citation Context ...stering problem directly. Some researchers have posed the problem in the framework of learning a distance metric, for which, eg., convex optimization methods can be employed (Bar-Hillel et al., 2003; =-=Xing et al., 2003-=-; Basu et al., 2003). Using a learned distance metric, one is able to use a standard clustering algorithm for doing the final predictions. These methods effectively solve all of the problems associate... |

515 | Factorial hidden markov models
- Ghahramani, Jordan
- 1997
(Show Context)
Citation Context ...Monte Carlo (MCMC) techniques, which are the most frequently used methods for inference in DP models (Antoniak, 1974; Escobar, 1994; Neal, 1998; MacEachern and Müller, 1998; Ishwaran and James, 2001; =-=Beal et al., 2002-=-; Xing et al., 2004). Recently, Blei and Jordan (2004) have presented a variational approach to Dirichlet process models, and Minka and Ghahramani (2004) have presented an inference procedure for DP m... |

496 |
Objective criteria for the evaluation of clustering methods
- Rand
- 1971
(Show Context)
Citation Context ...navailable. All of these metrics assume that we have a gold standard (correct) clustering G and a hypothesis clustering H and that the total number of data points is N. 6.1 Rand Index The rand index (=-=Rand, 1971-=-) is computed by viewing the clustering problem as a binary classification problem. Letting N11 denote the number of pairs that are in the same cluster in both G and in H, and letting N00 denote the n... |

457 |
Mixtures of dirichlet processes with applications to bayesian nonparametric problems
- Antoniak
- 1974
(Show Context)
Citation Context ...ution of the observed data. We will suggest and implement inference schemes based on Markov chain Monte Carlo (MCMC) techniques, which are the most frequently used methods for inference in DP models (=-=Antoniak, 1974-=-; Escobar, 1994; Neal, 1998; MacEachern and Müller, 1998; Ishwaran and James, 2001; Beal et al., 2002; Xing et al., 2004). Recently, Blei and Jordan (2004) have presented a variational approach to Dir... |

407 | Markov Chain Sampling Methods for Dirichlet Process Mixture Models
- Neal
(Show Context)
Citation Context ... we take the limit as K → ∞ and L → ∞ (where K is the number of publications and L is the number of reference types). This limit corresponds to a choice of a Dirichlet process prior on the ps and ts (=-=Neal, 1998-=-). 4. Inference Scheme Inference in infinite models differs from inference in finite models, primarily because we cannot store all possible values for infinite plates. However, as noted earlier, we on... |

275 | Efficient clustering of highdimensional data sets with application to reference matching
- McCallum, Nigam, et al.
- 2000
(Show Context)
Citation Context ...rticularly in the context of extracting citations from scholarly publications, the task is to identify which citations are to the same true publication. Here, the task is known as reference matching (=-=McCallum et al., 2000-=-). In natural language processing, the problem arises in the context of coreference resolution, wherein one wishes to identify which entities mentioned in a document are the same person (or organizati... |

237 | Gibbs sampling methods for stick-breaking priors
- Ishwaran, James
- 2001
(Show Context)
Citation Context ...mes based on Markov chain Monte Carlo (MCMC) techniques, which are the most frequently used methods for inference in DP models (Antoniak, 1974; Escobar, 1994; Neal, 1998; MacEachern and Müller, 1998; =-=Ishwaran and James, 2001-=-; Beal et al., 2002; Xing et al., 2004). Recently, Blei and Jordan (2004) have presented a variational approach to Dirichlet process models, and Minka and Ghahramani (2004) have presented an inference... |

182 | An efficient domain-independent algorithm for detecting approximately duplicate database records
- Monge, Elkan
- 1997
(Show Context)
Citation Context ... the real world (Cohen and Richman, 2002; Pasula et al., 2003). In the database community, the task arises in the context of merging databases with overlapping fields, and is known as record linkage (=-=Monge and Elkan, 1997-=-; Doan et al., 2004). In information extraction, particularly in the context of extracting citations from scholarly publications, the task is to identify which citations are to the same true publicati... |

174 |
Prior distributions on spaces of probability measures
- Ferguson
- 1974
(Show Context)
Citation Context ...) and (2) if P µ is a DP with parameter µ, then the conditional distribution of P µ given a sample X1, . . . , XN is a DP with parameter P µ + � N n=1 δXn, where δX is a point mass concentrated at X (=-=Ferguson, 1974-=-). The final useful fact is a correspondence between the DP and Pòlya Urns, described by Blackwell and MacQueen (1973). In the Pòlya Urn construction, we consider the situation of an urn from which we... |

162 | I.: Identity uncertainty and citation matching
- Pasula, Marthi, et al.
- 2003
(Show Context)
Citation Context ...ume that there is a one-to-one correspondence between c○2000 Hal Daumé III and Daniel Marcu.sDaumé III and Marcu elements in an knowledge base and entities in the real world (Cohen and Richman, 2002; =-=Pasula et al., 2003-=-). In the database community, the task arises in the context of merging databases with overlapping fields, and is known as record linkage (Monge and Elkan, 1997; Doan et al., 2004). In information ext... |

154 |
Estimating Mixture of Dirichlet Process Models
- MacEachern, Muller
- 1998
(Show Context)
Citation Context ...lo (MCMC) techniques, which are the most frequently used methods for inference in DP models (Antoniak, 1974; Escobar, 1994; 1556 n N t l G t (1)sA BAYESIAN MODEL FOR SUPERVISED CLUSTERING Neal, 1998; =-=MacEachern and Müller, 1998-=-; Ishwaran and James, 2001; Beal et al., 2002; Xing et al., 2004). Recently, Blei and Jordan (2005) have presented a variational approach to Dirichlet process models, and Minka and Ghahramani (2004) h... |

145 | Variational inference for Dirichlet process mixtures. Bayesian Analysis
- Blei, Jordan
- 2005
(Show Context)
Citation Context ...sals suggested by Jain and Neal (2003). Nevertheless, MCMC algorithms are notoriously slow and experiments employing variational or EP methods for the conjugate models might also improve performance (=-=Blei and Jordan, 2005-=-; Minka and Ghahramani, 2004). Our model is also similar to a distance metric learning algorithm. Under the Gaussian assumption, the reference types become covariance matrices, which—when there is onl... |

139 | N.: Learning Distance Functions Using Equivalence Relations
- Bar-Hillel, Hertz, et al.
- 2003
(Show Context)
Citation Context ... solve the supervised clustering problem directly. Some researchers have posed the problem in the framework of learning a distance metric, for which, eg., convex optimization methods can be employed (=-=Bar-Hillel et al., 2003-=-; Xing et al., 2003; Basu et al., 2003). Using a learned distance metric, one is able to use a standard clustering algorithm for doing the final predictions. These methods effectively solve all of the... |

136 | Learning to match and cluster large high-dimensional data sets for data integration
- Cohen, Richman
- 2002
(Show Context)
Citation Context ...citly (or explicitly) assume that there is a one-to-one correspondence between c○2000 Hal Daumé III and Daniel Marcu.sDaumé III and Marcu elements in an knowledge base and entities in the real world (=-=Cohen and Richman, 2002-=-; Pasula et al., 2003). In the database community, the task arises in the context of merging databases with overlapping fields, and is known as record linkage (Monge and Elkan, 1997; Doan et al., 2004... |

126 |
Conditional models of identity uncertainty with application to noun coreference
- McCallum, Wellner
- 2005
(Show Context)
Citation Context ...he context of coreference resolution, wherein one wishes to identify which entities mentioned in a document are the same person (or organization) in real life (Soon et al., 2001; Ng and Cardie, 2002; =-=McCallum and Wellner, 2004-=-). In the machine learning community, it has additionally been referred to as learning under equivalence constraints (Bar-Hillel and Weinshall, 2003) and learning from cluster examples (Kamishima and ... |

114 |
Estimating Normal means with a Dirichlet Process prior
- Escobar
- 1994
(Show Context)
Citation Context ...erved data. We will suggest and implement inference schemes based on Markov chain Monte Carlo (MCMC) techniques, which are the most frequently used methods for inference in DP models (Antoniak, 1974; =-=Escobar, 1994-=-; Neal, 1998; MacEachern and Müller, 1998; Ishwaran and James, 2001; Beal et al., 2002; Xing et al., 2004). Recently, Blei and Jordan (2004) have presented a variational approach to Dirichlet process ... |

103 | A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model - Jain, Neal - 2000 |

101 | Ontology matching: A machine learning approach
- Doan, Madhavan, et al.
- 2004
(Show Context)
Citation Context ...and Richman, 2002; Pasula et al., 2003). In the database community, the task arises in the context of merging databases with overlapping fields, and is known as record linkage (Monge and Elkan, 1997; =-=Doan et al., 2004-=-). In information extraction, particularly in the context of extracting citations from scholarly publications, the task is to identify which citations are to the same true publication. Here, the task ... |

98 | Some developments of the Blackwell-MacQueen urn scheme - Pitman - 1996 |

78 |
Bayesian density estimation by mixtures of normal distributions
- Ferguson
- 1983
(Show Context)
Citation Context ...en draws observations θ1, . . . from G. In such a model, one can analytically integrate out G to obtain the following conditional distributions from the observations θn (Blackwell and MacQueen, 1973; =-=Ferguson, 1983-=-): θn+1 | θ1, . . . , θn ∼ α n + α G0 + 1 n + α Thus, the n + 1st data point is drawn with probability proportional to α from the base distribution G0, and is exactly equal to a previously drawn θi wi... |

68 | Supervised clustering with support vector machines - Finley, Joachims - 2005 |

54 |
Yasemin Altun. Support vector machine learning for interdependent and structured output spaces
- Tsochantaridis, Hofmann, et al.
- 2004
(Show Context)
Citation Context ...d be argued that the perceptron “approximation” is actually superior to the CRF, since it optimizes something closer to “accuracy” than the log-loss optimized by the CRF. 3sDaumé III and Marcu nique (=-=Tsochantaridis et al., 2004-=-). In this model, a particular clustering method, correlation clustering, is held fixed, and weights are optimized to minimize the regularized empirical loss of the training data with respect to this ... |

52 | Comparing clusterings - Meila - 2003 |

47 | Variational methods for the dirichlet process - Blei, Jordan - 2004 |

39 | Hyperparameter Estimation in Dirichlet Process Mixture Models - West - 1992 |

27 | Clustering by Committee - Pantel - 2003 |

21 | Comparing and unifying searchbased and similarity-based approaches to semi-supervised clustering
- Basu, Bilenko, et al.
- 2004
(Show Context)
Citation Context ...ectly. Some researchers have posed the problem in the framework of learning a distance metric, for which, eg., convex optimization methods can be employed (Bar-Hillel et al., 2003; Xing et al., 2003; =-=Basu et al., 2003-=-). Using a learned distance metric, one is able to use a standard clustering algorithm for doing the final predictions. These methods effectively solve all of the problems associated with the classifi... |

20 | Nonparametric Bayesian logic - Carbonetto, Kisynski, et al. - 2005 |

13 | Hwee Tou Ng, and Daniel Chung Yong Lim. A machine learning approach to coreference resolution of noun phrases - Soon |

8 | Learning with equivalence constraints and the relation to multiclass learning
- Bar-Hillel, Weinshall
- 2003
(Show Context)
Citation Context ...) in real life (Soon et al., 2001; Ng and Cardie, 2002; McCallum and Wellner, 2004). In the machine learning community, it has additionally been referred to as learning under equivalence constraints (=-=Bar-Hillel and Weinshall, 2003-=-) and learning from cluster examples (Kamishima and Motoyoshi, 2003). In this paper, we propose a generative model for solving the supervised clustering problem. Our model takes advantage of the Diric... |

7 |
distributions via polya urn scheme
- Ferguson
- 1973
(Show Context)
Citation Context ...) > 0 for all k) , then the joint distribution of random probabilities (P µ (B1),...,P µ (BK)) is distributed according to Dir(µ(B1),...,µ(BK)), where Dir denotes the standard Dirichlet distribution (=-=Ferguson, 1973-=-, 1974). In words: P µ is a Dirichlet process if it behaves as if it were a Dirichlet distribution on any finite partition of the original space. It is typically useful to write µ= αG0, where α = R Ω ... |

1 | The Annals of Statistics, 1(2):353–355 - Jordan - 1973 |

1 | Toshihiro Kamishima and Fumio Motoyoshi. Learning from cluster examples - MacEachern, Müller - 2003 |

1 | Minka and Zoubin Ghahramani. Expectation propagation for infinite mixtures - Thomas - 2004 |