## A Latent Dirichlet Model for Unsupervised Entity Resolution (2006)

Venue: | SIAM INTERNATIONAL CONFERENCE ON DATA MINING |

Citations: | 73 - 6 self |

### BibTeX

@INPROCEEDINGS{Bhattacharya06alatent,

author = {Indrajit Bhattacharya and Lise Getoor},

title = {A Latent Dirichlet Model for Unsupervised Entity Resolution},

booktitle = {SIAM INTERNATIONAL CONFERENCE ON DATA MINING},

year = {2006},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.

### Citations

2366 | Latent Dirichlet Allocation
- Blei, Ng, et al.
(Show Context)
Citation Context ...els relations between entities using group membership. We introduce a generative probabilistic model for entity resolution that builds on the recently proposed Latent Dirichlet Allocation model (LDA) =-=[6]-=-. Unlike most existing models, we do not introduce a decision variable for each potential duplicate pair of references, but instead have an entity label for each reference. To model collaborative rela... |

708 |
A Bayesian analysis of some nonparametric problems
- Ferguson
- 1973
(Show Context)
Citation Context ...label or alternatively a hitherto unused one. For a new entity label, its observed occurrence count C AT (−i)ati 9.2 Relation to the Dirichlet Process The Dirichlet process was introduced by Ferguson =-=[14]-=- and Antoniak [2] as a non-parametric statistical approach that allows the complexity of the model to grow with increasing size of the data. In the context of our application, we would like the number... |

625 |
Finding scientific topics
- Griffiths, Steyvers
- 2004
(Show Context)
Citation Context ...ing Gibbs Sampling In general, the integral in Eq. (5.1) is intractable due to coupling between θ and φ. Different approximations havebeen proposed, including variational methods [6], Gibbs sampling =-=[16]-=- and Expectation Propagation [25]. We extend the approach proposed by Griffiths et al. [16] for our model. Now θ and φ are not directly estimated as parameters. Instead, we first construct the posteri... |

531 | Probabilistic latent semantic analysis
- Hofmann
- 1999
(Show Context)
Citation Context ...ch so that the graph has nodes for all potential duplicate pairs and all pairs of similar attributes. We model collaborative groups using LDA [6] which improves Probabilistic Latent Semantic Indexing =-=[18]-=- as a generative topic model for documents. The related authortopic model [31] recognizes the problem of duplicate authors; here we propose a solution for it. Kubica et al. [21] have proposed generati... |

416 |
Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics
- Antoniak
- 1974
(Show Context)
Citation Context ...vely a hitherto unused one. For a new entity label, its observed occurrence count C AT (−i)ati 9.2 Relation to the Dirichlet Process The Dirichlet process was introduced by Ferguson [14] and Antoniak =-=[2]-=- as a non-parametric statistical approach that allows the complexity of the model to grow with increasing size of the data. In the context of our application, we would like the number of entities to b... |

397 |
A Theory for Record Linkage
- Fellegi, Sunter
- 1969
(Show Context)
Citation Context ...rning [32]. In addition, data integration is an area of active research [17, 26, 23]. The groundwork for posing record linkage as a probabilistic classification problem was done by Fellegi and Sunter =-=[13]-=-. Winkler [34] builds upon this work by introducing a latent match variable estimated using Expectation Maximization. More recently, hierarchical graphical models have been proposed [30]. Probabilisti... |

372 | Markov chain sampling methods for Dirichlet process mixture models
- Neal
- 1998
(Show Context)
Citation Context ...−1+α α n−1+α where ni is the number of times η∗ i has occurred in η1:n−1. Exact inference is intractable in the Dirichlet process mixture model but approximate inference techniques have been proposed =-=[27, 5]-=-. Of particular interest is the Gibbs sampling strategy proposed by Neal [27]. This algorithm iteratively samples the component label ai for the ith data object ri from the conditional distribution gi... |

333 | A comparison of string distance metrics for name-matching tasks
- Cohen, Ravikumar, et al.
- 2003
(Show Context)
Citation Context ... linkage, and co-reference detection. The traditional approach to entity resolution considers similarity of textual attributes. There has been extensive work on approximate string matching algorithms =-=[26, 8]-=- and adaptive algorithms that learn string similarity measures [4, 9, 33]. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integratio... |

308 |
Expectation Propagation for Approximate Bayesian Inference
- Minka
- 2001
(Show Context)
Citation Context ...e integral in Eq. (5.1) is intractable due to coupling between θ and φ. Different approximations havebeen proposed, including variational methods [6], Gibbs sampling [16] and Expectation Propagation =-=[25]-=-. We extend the approach proposed by Griffiths et al. [16] for our model. Now θ and φ are not directly estimated as parameters. Instead, we first construct the posterior distribution P (z, a | r) and ... |

300 | The Merge/Purge Problem for Large Databases
- Hernandez, Stolfo
- 1995
(Show Context)
Citation Context ...arn string similarity measures [4, 9, 33]. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integration is an area of active research =-=[17, 26, 23]-=-. The groundwork for posing record linkage as a probabilistic classification problem was done by Fellegi and Sunter [13]. Winkler [34] builds upon this work by introducing a latent match variable esti... |

258 | Ungar “Efficient Clustering of High Dimensional Data Sets with Application to Reference
- McCallum, Nigam, et al.
- 2000
(Show Context)
Citation Context ...arn string similarity measures [4, 9, 33]. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integration is an area of active research =-=[17, 26, 23]-=-. The groundwork for posing record linkage as a probabilistic classification problem was done by Fellegi and Sunter [13]. Winkler [34] builds upon this work by introducing a latent match variable esti... |

237 | Adaptive Duplicate Detection Using Learnable String Similarity Measures
- Bilenko, Mooney
- 2003
(Show Context)
Citation Context ...ty resolution considers similarity of textual attributes. There has been extensive work on approximate string matching algorithms [26, 8] and adaptive algorithms that learn string similarity measures =-=[4, 9, 33]-=-. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integration is an area of active research [17, 26, 23]. The groundwork for posing r... |

233 | The author-topic model for authors and documents
- Rosen-Zvi, Griffiths, et al.
- 2004
(Show Context)
Citation Context ... of similar attributes. We model collaborative groups using LDA [6] which improves Probabilistic Latent Semantic Indexing [18] as a generative topic model for documents. The related authortopic model =-=[31]-=- recognizes the problem of duplicate authors; here we propose a solution for it. Kubica et al. [21] have proposed generative models for links using underlying groups, but they do not handle identity u... |

227 | CiteSeer: An Automatic Citation Indexing System
- Giles, Bollacker, et al.
- 1998
(Show Context)
Citation Context ...m experimental evaluations on two citation datasets. The first is the CiteSeer dataset containing citations to papers from four different areas in machine learning, originally created by Giles et al. =-=[15]-=-. This has 2,892 references to 1,165 authors, contained in 1,504 documents. The second dataset is significantly larger; arXiv (HEP) contains papers from high energy physics used in KDD Cup 2003 2 . Th... |

197 | Interactive Deduplication Using Active Learning
- Sarawagi, Bhamidipaty
- 2002
(Show Context)
Citation Context ...string matching algorithms [26, 8] and adaptive algorithms that learn string similarity measures [4, 9, 33]. Beyond applying standard machine learning techniques, other approaches use active learning =-=[32]-=-. In addition, data integration is an area of active research [17, 26, 23]. The groundwork for posing record linkage as a probabilistic classification problem was done by Fellegi and Sunter [13]. Wink... |

155 | Robust and efficient fuzzy match for online data cleaning
- Chaudhuri, Ganjam, et al.
- 2006
(Show Context)
Citation Context ... more difficult problem where neither the entities nor the number of entities is known. Non-probabilistic approaches that take relational features into account for data integration have been proposed =-=[11, 7, 1, 3, 20, 12]-=-. Chaudhuri et al. [7] make use of join information for deduplication but assume the secondary tables themselves to be clean. The notion of co-occurrence in dimensional hierarchies has also been propo... |

153 | Identity uncertainty and citation matching
- Pasula, Marthi, et al.
- 2003
(Show Context)
Citation Context ...ow that collective entity resolution improves performance over independent pair-wise resolution. There is a long history of work in both general and relational entity resolution. Recently, generative =-=[22, 29]-=- and discriminative [24, 28] probabilistic approaches have been proposed as well as non-probabilistic algorithms [20, 12]. Our model differs from most of the above in that it is unsupervised, does not... |

152 | The Field Matching Problems: Algorithms and Applications
- Monge, Elkan
- 1996
(Show Context)
Citation Context ... linkage, and co-reference detection. The traditional approach to entity resolution considers similarity of textual attributes. There has been extensive work on approximate string matching algorithms =-=[26, 8]-=- and adaptive algorithms that learn string similarity measures [4, 9, 33]. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integratio... |

128 | Learning to match and cluster large high-dimensional data sets for data integration
- Cohen, Richman
- 2002
(Show Context)
Citation Context ...ty resolution considers similarity of textual attributes. There has been extensive work on approximate string matching algorithms [26, 8] and adaptive algorithms that learn string similarity measures =-=[4, 9, 33]-=-. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integration is an area of active research [17, 26, 23]. The groundwork for posing r... |

121 |
Conditional models of identity uncertainty with application to noun coreference
- McCallum, Wellner
- 2004
(Show Context)
Citation Context ...solution improves performance over independent pair-wise resolution. There is a long history of work in both general and relational entity resolution. Recently, generative [22, 29] and discriminative =-=[24, 28]-=- probabilistic approaches have been proposed as well as non-probabilistic algorithms [20, 12]. Our model differs from most of the above in that it is unsupervised, does not assume the underlying entit... |

119 | Reference reconciliation in complex information spaces
- DONG, HALEVY, et al.
- 2005
(Show Context)
Citation Context ...of work in both general and relational entity resolution. Recently, generative [22, 29] and discriminative [24, 28] probabilistic approaches have been proposed as well as non-probabilistic algorithms =-=[20, 12]-=-. Our model differs from most of the above in that it is unsupervised, does not assume the underlying entities to be known, does not make pairwise decisions and explicitly models relations between ent... |

111 | Eliminating fuzzy duplicates in data warehouses
- Ananthakrishna, Chaudhuri, et al.
- 2002
(Show Context)
Citation Context ... more difficult problem where neither the entities nor the number of entities is known. Non-probabilistic approaches that take relational features into account for data integration have been proposed =-=[11, 7, 1, 3, 20, 12]-=-. Chaudhuri et al. [7] make use of join information for deduplication but assume the secondary tables themselves to be clean. The notion of co-occurrence in dimensional hierarchies has also been propo... |

91 | Learning object identification rules for information integration
- Tejada, Knoblock, et al.
(Show Context)
Citation Context ...ty resolution considers similarity of textual attributes. There has been extensive work on approximate string matching algorithms [26, 8] and adaptive algorithms that learn string similarity measures =-=[4, 9, 33]-=-. Beyond applying standard machine learning techniques, other approaches use active learning [32]. In addition, data integration is an area of active research [17, 26, 23]. The groundwork for posing r... |

75 | Exploiting relationships for domain-independent data cleaning
- Kalashnikov, Mehrotra, et al.
- 2005
(Show Context)
Citation Context ...of work in both general and relational entity resolution. Recently, generative [22, 29] and discriminative [24, 28] probabilistic approaches have been proposed as well as non-probabilistic algorithms =-=[20, 12]-=-. Our model differs from most of the above in that it is unsupervised, does not assume the underlying entities to be known, does not make pairwise decisions and explicitly models relations between ent... |

63 |
Iterative Record Linkage for Cleaning and Integration
- Bhattacharya, Getoor
- 2004
(Show Context)
Citation Context ... more difficult problem where neither the entities nor the number of entities is known. Non-probabilistic approaches that take relational features into account for data integration have been proposed =-=[11, 7, 1, 3, 20, 12]-=-. Chaudhuri et al. [7] make use of join information for deduplication but assume the secondary tables themselves to be clean. The notion of co-occurrence in dimensional hierarchies has also been propo... |

54 | Stochastic link and group detection
- Kubica, Moore, et al.
- 2002
(Show Context)
Citation Context ...tent Semantic Indexing [18] as a generative topic model for documents. The related authortopic model [31] recognizes the problem of duplicate authors; here we propose a solution for it. Kubica et al. =-=[21]-=- have proposed generative models for links using underlying groups, but they do not handle identity uncertainty. 4 LDA Model for Authors In this section, we show how the LDA model for topics and words... |

51 | Multi-relational record linkage
- Domingos
- 2004
(Show Context)
Citation Context ...solution improves performance over independent pair-wise resolution. There is a long history of work in both general and relational entity resolution. Recently, generative [22, 29] and discriminative =-=[24, 28]-=- probabilistic approaches have been proposed as well as non-probabilistic algorithms [20, 12]. Our model differs from most of the above in that it is unsupervised, does not assume the underlying entit... |

42 | Variational methods for the Dirichlet process
- Blei, Jordan
- 2004
(Show Context)
Citation Context ...−1+α α n−1+α where ni is the number of times η∗ i has occurred in η1:n−1. Exact inference is intractable in the Dirichlet process mixture model but approximate inference techniques have been proposed =-=[27, 5]-=-. Of particular interest is the Gibbs sampling strategy proposed by Neal [27]. This algorithm iteratively samples the component label ai for the ith data object ri from the conditional distribution gi... |

33 | Methods for record linkage and bayesian networks
- Winkler
- 2002
(Show Context)
Citation Context ... addition, data integration is an area of active research [17, 26, 23]. The groundwork for posing record linkage as a probabilistic classification problem was done by Fellegi and Sunter [13]. Winkler =-=[34]-=- builds upon this work by introducing a latent match variable estimated using Expectation Maximization. More recently, hierarchical graphical models have been proposed [30]. Probabilistic models that ... |

32 | A hierarchical graphical model for record linkage
- Ravikumar, WW
- 2004
(Show Context)
Citation Context ...gi and Sunter [13]. Winkler [34] builds upon this work by introducing a latent match variable estimated using Expectation Maximization. More recently, hierarchical graphical models have been proposed =-=[30]-=-. Probabilistic models that take into account interaction between different entity resolution decisions have been proposed for named entity recognition in natural language processing and for citation ... |

26 | A Bayesian model for supervised clustering with the Dirichlet process prior
- Daume, Marcu
- 2005
(Show Context)
Citation Context ...or the citation matching problem. This captures dependence between identities of co-authors of the same paper, but does not model collaborative probabilities between authors directly. Daumé and Marcu =-=[19]-=- have recently proposed an extension to Pasula et al.’s model, where the number of clusters or entities is directly modeled by a Dirichlet Process and is similar in spirit to ours. However, we propose... |

18 |
Robust reading: Identification and tracing of ambiguous names
- Li, Morie, et al.
- 2004
(Show Context)
Citation Context ...ow that collective entity resolution improves performance over independent pair-wise resolution. There is a long history of work in both general and relational entity resolution. Recently, generative =-=[22, 29]-=- and discriminative [24, 28] probabilistic approaches have been proposed as well as non-probabilistic algorithms [20, 12]. Our model differs from most of the above in that it is unsupervised, does not... |

13 |
Object matching for data integration: A profile-based approach
- Doan, Lu, et al.
- 2003
(Show Context)
Citation Context |

3 | A conditional model of deduplication for multi-type relational data
- Culotta, McCallum
- 2005
(Show Context)
Citation Context ... parameters where the decision for one pair affects another through their overlap. Parag et al. [28] extend the CRF model to merge evidence across multiple fields. More recently, Culotta and McCallum =-=[10]-=- have considered relations between multiple types to deduplicate them jointly. However, all of these models consider pairwise decisions between potential duplicates and are supervised in that their pa... |