## Entity Resolution with Markov Logic (2006)

### Cached

### Download Links

Venue: | In ICDM |

Citations: | 78 - 9 self |

### BibTeX

@INPROCEEDINGS{Singla06entityresolution,

author = {Parag Singla and Pedro Domingos},

title = {Entity Resolution with Markov Logic},

booktitle = {In ICDM},

year = {2006},

pages = {572--582},

publisher = {IEEE Computer Society Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Entity resolution is the problem of determining which records in a database refer to the same entities, and is a crucial and expensive step in the data mining process. Interest in it has grown rapidly in recent years, and many approaches have been proposed. However, they tend to address only isolated aspects of the problem, and are often ad hoc. This paper proposes a well-founded, integrated solution to the entity resolution problem based on Markov logic. Markov logic combines first-order logic and probabilistic graphical models by attaching weights to first-order formulas, and viewing them as templates for features of Markov networks. We show how a number of previous approaches can be formulated and seamlessly combined in Markov logic, and how the resulting learning and inference problems can be solved efficiently. Experiments on two citation databases show the utility of this approach, and evaluate the contribution of the different components. 1

### Citations

7319 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...riments, and outline directions for future work. 2 Markov Networks A Markov network (also known as Markov random field) is a model for the joint distribution of a set of variables X =(X1,X2,...,Xn) ∈X=-=[35]-=-. It is composed of an undirected graph G and a set of potential functions φk. The graph has a node for each variable, and the model has a potential function for each clique in the graph. A potential ... |

2036 |
Numerical Optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...nnot be computed in closed form, but, because the log-likelihood is a concave function of the weights, they can be found efficiently using standard gradient-based or quasi-Newton optimization methods =-=[32]-=-. Another alternative is iterative scaling [11]. Features can also be learned from data, for example by greedily constructing conjunctions of atomic features [11]. 3 First-Order Logic A first-order kn... |

1360 |
An Introduction to Categorical Data Analysis
- Agresti
- 1996
(Show Context)
Citation Context ...fy it as “Match” or “Non-match.” A separate match decision is made for each candidate pair, followed by transitive closure to eliminate inconsistencies. Typically, a logistic regression model is used =-=[1]-=-. One line of research has focused on scaling entity resolution to large databases by avoiding the quadratic number of comparisons between all pairs of entities (e.g., [20, 30, 26, 7]). Another has fo... |

593 | Markov Logic Networks
- Richardson, Domingos
- 2006
(Show Context)
Citation Context ...ovides rich representations and efficient inference and learning algorithms for non-i.i.d. data [15, 13]. In particular, we use Markov logic, which combines first-order logic and Markov random fields =-=[36]-=-, with weighted satisfiability testing for efficient inference and a voted perceptron algorithm for discriminative learning [41]. Our formulation makes it practical to combine many different component... |

565 | Inducing Features of Random Fields
- Pietra, Pietra, et al.
- 1995
(Show Context)
Citation Context ...he log-likelihood is a concave function of the weights, they can be found efficiently using standard gradient-based or quasi-Newton optimization methods [32]. Another alternative is iterative scaling =-=[11]-=-. Features can also be learned from data, for example by greedily constructing conjunctions of atomic features [11]. 3 First-Order Logic A first-order knowledge base (KB) is a set of sentences or form... |

510 | Discriminative training methods for hidden Markov models: Theory and experiments with the perceptron algorithm
- Collins
- 2002
(Show Context)
Citation Context ....e., the most likely state of y given x). This is a good approximation if most of the probability mass of Pw(y|x) is concentrated around y ∗ w(x), and is the essence of the voted perceptron algorithm =-=[8]-=-, which initializes all weights to zero, performs T steps of gradient descent, and returns the weights averaged over all iterations (wi = � T t=1 wi,t/T ). While it was originally developed for the sp... |

418 |
A Theory for Record Linkage
- Fellegi, Sunter
- 1969
(Show Context)
Citation Context ...DD-2003 [27] and a related task as part of the 2003 KDD Cup [16]. The entity resolution problem was first identified by Newcombe et al. [31], and given a statistical formulation by Fellegi and Sunter =-=[14]-=-. Most current approaches are variants of the Fellegi-Sunter model, in which entity resolution is viewed as a classification problem: given a vector of similarity scores between the attributes of two ... |

310 | The merge/purge problem for large databases
- Hernández, Stolfo
- 1995
(Show Context)
Citation Context ...tic regression model is used [1]. One line of research has focused on scaling entity resolution to large databases by avoiding the quadratic number of comparisons between all pairs of entities (e.g., =-=[20, 30, 26, 7]-=-). Another has focused on the use of active learning techniques to minimize the need for labeled data (e.g., [44, 38, 4]). Several authors have devised, compared and learned similarity measures for us... |

281 | Local search strategies for satisfiability testing
- Selman, Kautz, et al.
- 1993
(Show Context)
Citation Context ...orm) is satisfiable, i.e., if there is an assignment of truth values to ground atoms that makes the KB true. One approach to this problem is stochastic local search, exemplified by the WalkSAT solver =-=[39]-=-. Beginning with a random truth assignment, WalkSAT repeatedly flips the truth value of either (a) an atom that maximizes the number of satisfied clauses, or (b) a random atom in an unsatisfied clause... |

272 | Efficient clustering of high-dimensional data sets with application to reference matching
- McCallum, Nigam, et al.
- 2000
(Show Context)
Citation Context ...tic regression model is used [1]. One line of research has focused on scaling entity resolution to large databases by avoiding the quadratic number of comparisons between all pairs of entities (e.g., =-=[20, 30, 26, 7]-=-). Another has focused on the use of active learning techniques to minimize the need for labeled data (e.g., [44, 38, 4]). Several authors have devised, compared and learned similarity measures for us... |

252 | R.J.: Adaptive duplicate detection using learnable string similarity measures
- Bilenko, Mooney
- 2003
(Show Context)
Citation Context ...use of active learning techniques to minimize the need for labeled data (e.g., [44, 38, 4]). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., =-=[6, 45, 3]-=-). A number of alternate formulations have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including te... |

233 | The state of record linkage and current research problems
- Winkler
- 1999
(Show Context)
Citation Context ...een proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including text (e.g., [25]) and images (e.g., [21]). Winkler =-=[46]-=- surveys research in traditional record linkage. Most recently, several authors have pointed out that match decisions should not be made independently for each candidate pair. While the Fellegi-Sunter... |

230 | On the hardness of approximate reasoning - Roth - 1996 |

203 | Interactive deduplication using active learning
- Sarawagi, Bhamidipaty
- 2002
(Show Context)
Citation Context ...ng the quadratic number of comparisons between all pairs of entities (e.g., [20, 30, 26, 7]). Another has focused on the use of active learning techniques to minimize the need for labeled data (e.g., =-=[44, 38, 4]-=-). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., [6, 45, 3]). A number of alternate formulations have also been proposed (e.g., [5]). Entit... |

177 | An efficient domainindependent algorithm for detecting approximately duplicate database records
- Monge, Elkan
- 1997
(Show Context)
Citation Context ...tic regression model is used [1]. One line of research has focused on scaling entity resolution to large databases by avoiding the quadratic number of comparisons between all pairs of entities (e.g., =-=[20, 30, 26, 7]-=-). Another has focused on the use of active learning techniques to minimize the need for labeled data (e.g., [44, 38, 4]). Several authors have devised, compared and learned similarity measures for us... |

160 | Identity uncertainty and citation matching
- Pasula, Marthi, et al.
- 2003
(Show Context)
Citation Context ...n turn is evidence that other pairs of papers by the same authors should be matched, etc.). McCallum and Wellner [28] incorporate the transitive closure step into the statistical model. Pasula et al. =-=[34]-=- incorporate parsing of entities from citation lists into a citation matchingsmodel. Bhattacharya and Getoor [2] use coauthorship relations to help match authors in citation databases. Milch et al [29... |

135 | Probabilistic models with unknown objects
- Milch
- 2006
(Show Context)
Citation Context ...34] incorporate parsing of entities from citation lists into a citation matchingsmodel. Bhattacharya and Getoor [2] use coauthorship relations to help match authors in citation databases. Milch et al =-=[29]-=- propose a language for reasoning about entity resolution. Shen et al. [40] exploit various types of constraints to improve matching accuracy. Davis et al. [10] use inductive logic programming techniq... |

134 |
Automatic linkage of vital records
- Newcombe, Kennedy, et al.
- 1959
(Show Context)
Citation Context ...ntion in the data mining community, with a related workshop at KDD-2003 [27] and a related task as part of the 2003 KDD Cup [16]. The entity resolution problem was first identified by Newcombe et al. =-=[31]-=-, and given a statistical formulation by Fellegi and Sunter [14]. Most current approaches are variants of the Fellegi-Sunter model, in which entity resolution is viewed as a classification problem: gi... |

133 | Learning to match and cluster large high-dimensional data sets for data integration
- Cohen, Richman
- 2002
(Show Context)
Citation Context |

127 | Reference reconciliation in complex information spaces
- Dong, Halevy, et al.
- 2005
(Show Context)
Citation Context ...learning and inference, it also offers the opportunity to improve entity resolution, by taking into account information that was previously ignored. For example, Singla and Domingos [42], Dong et al. =-=[12]-=- and Culotta and McCallum [9] allow the resolution of entities of one type to be helped by resolution of entities of related types (e.g., if two papers are the same, their authors are the same, which ... |

123 |
Conditional models of identity uncertainty with application to noun coreference
- McCallum, Wellner
- 2004
(Show Context)
Citation Context ...of related types (e.g., if two papers are the same, their authors are the same, which in turn is evidence that other pairs of papers by the same authors should be matched, etc.). McCallum and Wellner =-=[28]-=- incorporate the transitive closure step into the statistical model. Pasula et al. [34] incorporate parsing of entities from citation lists into a citation matchingsmodel. Bhattacharya and Getoor [2] ... |

112 | Learning domain-independent string transformation weights for high accuracy object identification
- Tejada, Knoblock, et al.
- 2002
(Show Context)
Citation Context ...use of active learning techniques to minimize the need for labeled data (e.g., [44, 38, 4]). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., =-=[6, 45, 3]-=-). A number of alternate formulations have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including te... |

106 |
P.: The Alchemy system for statistical relational AI
- Kok, Singla, et al.
- 2005
(Show Context)
Citation Context ...erence and voted perceptron learning are efficient enough to be practical for domains of realistic size. These algorithms are publicly available in the Alchemy system, which we use in our experiments =-=[24]-=-. 1 If the closed world assumption is not made, the truth values of some atoms are unknown, and weights can be learned using a form of the EM algorithm. 4 5 Entity Resolution 5.1 Equality in Markov Lo... |

93 | Learning object identification rules for information integration
- Tejada, Knoblock, et al.
- 2001
(Show Context)
Citation Context ...ng the quadratic number of comparisons between all pairs of entities (e.g., [20, 30, 26, 7]). Another has focused on the use of active learning techniques to minimize the need for labeled data (e.g., =-=[44, 38, 4]-=-). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., [6, 45, 3]). A number of alternate formulations have also been proposed (e.g., [5]). Entit... |

92 | Learning the structure of Markov logic networks
- Kok, Domingos
- 2005
(Show Context)
Citation Context ...nd the MAP state, Singla and Domingos [41] generalized it to MLNs by replacing Viterbi with MaxWalkSAT. It is also possible to learn the structure of MLNs using inductive logic programming techniques =-=[23]-=-. Learning can start from an empty network, or from an initial knowledge base. Markov logic affords us the expressiveness of first-order logic while avoiding its brittleness, and makes it easy to inco... |

87 | S.: A Comparison of String Metrics for Matching Names and Records
- Cohen, Ravikumar, et al.
- 2003
(Show Context)
Citation Context ...use of active learning techniques to minimize the need for labeled data (e.g., [44, 38, 4]). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., =-=[6, 45, 3]-=-). A number of alternate formulations have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including te... |

80 | Discriminative training of markov logic networks
- Singla, Domingos
- 2005
(Show Context)
Citation Context ...Markov logic, which combines first-order logic and Markov random fields [36], with weighted satisfiability testing for efficient inference and a voted perceptron algorithm for discriminative learning =-=[41]-=-. Our formulation makes it practical to combine many different components into a comprehensive solution to the entity resolution problem. We illustrate this in this paper by combining a few salient on... |

70 |
Iterative record linkage for cleaning and integration
- Bhattacharya, Getoor
- 2004
(Show Context)
Citation Context ...[28] incorporate the transitive closure step into the statistical model. Pasula et al. [34] incorporate parsing of entities from citation lists into a citation matchingsmodel. Bhattacharya and Getoor =-=[2]-=- use coauthorship relations to help match authors in citation databases. Milch et al [29] propose a language for reasoning about entity resolution. Shen et al. [40] exploit various types of constraint... |

56 | Hardening soft information sources
- Cohen, Kautz, et al.
- 2000
(Show Context)
Citation Context ...[44, 38, 4]). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., [6, 45, 3]). A number of alternate formulations have also been proposed (e.g., =-=[5]-=-). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including text (e.g., [25]) and images (e.g., [21]). Winkler [46] surveys research i... |

49 | A general stochastic approach to solving problems with hard and soft constraints
- Kautz, Selman, et al.
- 1996
(Show Context)
Citation Context ...iant of satisfiability where each clause has an associated weight, and the goal is to maximize the sum of the weights of satisfied clauses. MaxWalkSAT is a direct extension of WalkSAT to this problem =-=[22]-=-. 4 Markov Logic A first-order KB can be seen as a set of hard constraints on the set of possible worlds: if a world violates even one formula, it has zero probability. The basic idea in Markov logic ... |

44 |
Semantic integration in text: from ambiguous names to identifiable entities
- Li, Morie, et al.
(Show Context)
Citation Context ...r of alternate formulations have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including text (e.g., =-=[25]-=-) and images (e.g., [21]). Winkler [46] surveys research in traditional record linkage. Most recently, several authors have pointed out that match decisions should not be made independently for each c... |

37 | Object identification: A bayesian analysis with application to traffic surveillance
- Huang, Russel
- 1998
(Show Context)
Citation Context ...ons have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., [33, 10]) and to different types of data, including text (e.g., [25]) and images (e.g., =-=[21]-=-). Winkler [46] surveys research in traditional record linkage. Most recently, several authors have pointed out that match decisions should not be made independently for each candidate pair. While the... |

37 | Object identification with attribute-mediated dependences
- Singla, Domingos
- 2005
(Show Context)
Citation Context ...dency complicates learning and inference, it also offers the opportunity to improve entity resolution, by taking into account information that was previously ignored. For example, Singla and Domingos =-=[42]-=-, Dong et al. [12] and Culotta and McCallum [9] allow the resolution of entities of one type to be helped by resolution of entities of related types (e.g., if two papers are the same, their authors ar... |

33 | On evaluation and training-set construction for duplicate detection
- Bilenko, Mooney
- 2003
(Show Context)
Citation Context ...ng the quadratic number of comparisons between all pairs of entities (e.g., [20, 30, 26, 7]). Another has focused on the use of active learning techniques to minimize the need for labeled data (e.g., =-=[44, 38, 4]-=-). Several authors have devised, compared and learned similarity measures for use in entity resolution (e.g., [6, 45, 3]). A number of alternate formulations have also been proposed (e.g., [5]). Entit... |

33 |
Constraint-Based Entity Matching
- Shen, Li, et al.
(Show Context)
Citation Context ...chingsmodel. Bhattacharya and Getoor [2] use coauthorship relations to help match authors in citation databases. Milch et al [29] propose a language for reasoning about entity resolution. Shen et al. =-=[40]-=- exploit various types of constraints to improve matching accuracy. Davis et al. [10] use inductive logic programming techniques to discover relational rules for entity resolution, which they then com... |

32 | Memory-efficient inference in relational domains
- Singla, Domingos
- 2006
(Show Context)
Citation Context ...t lazily grounds predicates and clauses, effectively allowing predicates that are initially assumed false to be revisited, without incurring the computational cost of completely grounding the network =-=[43]-=-. Incorporating this into our entity resolution system is an item for future work. 6 Experiments 6.1 Datasets We used two publicly available citation databases in our experiments: Cora and BibServ. 6.... |

30 | Using q-grams in a dbms for approximate string processing
- Gravano, Ipeirotis, et al.
(Show Context)
Citation Context ...words. Since these are often a significant issue in entity resolution, it would be desirable to account for them. One way to do this efficiently is to compare word strings by the engrams they contain =-=[19]-=-. This can be done in our framework by defining the predicate HasEngram(word, engram), which is true iff engram is a substring of word. (This predicate can be computed on the fly from its arguments, o... |

21 |
A.: Joint deduplication of multiple record types in relational data
- Culotta, McCallum
- 2005
(Show Context)
Citation Context ...o offers the opportunity to improve entity resolution, by taking into account information that was previously ignored. For example, Singla and Domingos [42], Dong et al. [12] and Culotta and McCallum =-=[9]-=- allow the resolution of entities of one type to be helped by resolution of entities of related types (e.g., if two papers are the same, their authors are the same, which in turn is evidence that othe... |

10 |
Establishing identity equivalence in multi-relational domains
- Davis, Dutra, et al.
- 2005
(Show Context)
Citation Context ...easures for use in entity resolution (e.g., [6, 45, 3]). A number of alternate formulations have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., =-=[33, 10]-=-) and to different types of data, including text (e.g., [25]) and images (e.g., [21]). Winkler [46] surveys research in traditional record linkage. Most recently, several authors have pointed out that... |

6 |
A hit-miss model for duplicate detection in the who drug safety database
- Norén, Orre, et al.
- 2005
(Show Context)
Citation Context ...easures for use in entity resolution (e.g., [6, 45, 3]). A number of alternate formulations have also been proposed (e.g., [5]). Entity resolution has been applied in a wide variety of domains (e.g., =-=[33, 10]-=-) and to different types of data, including text (e.g., [25]) and images (e.g., [21]). Winkler [46] surveys research in traditional record linkage. Most recently, several authors have pointed out that... |