## Methods for record linkage and bayesian networks (2002)

Venue: | Series RRS2002/05, U.S. Bureau of the Census |

Citations: | 34 - 3 self |

### BibTeX

@TECHREPORT{Winkler02methodsfor,

author = {William E. Winkler},

title = {Methods for record linkage and bayesian networks},

institution = {Series RRS2002/05, U.S. Bureau of the Census},

year = {2002}

}

### Years of Citing Articles

### OpenURL

### Abstract

Although terminology differs, there is considerable overlap between record linkage methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (Mitchell 1997). Both are based on formal probabilistic models that can be shown to be equivalent in many situations (Winkler 2000). When no missing data are present in identifying fields and training data are available, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data (Friedman 1997) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Belin and Rubin

### Citations

1336 | Additive logistic regression: A statistical view of boosting - Friedman, Hastie, et al. - 2000 |

1174 | Graphical Models - Lauritzen - 1996 |

1092 | The EM Algorithm and Extensions - McLachlan - 1997 |

1017 | The Elements of - Hastie, Tibshirani, et al. - 2009 |

867 | Text classification from labeled and unlabeled documents using
- Nigam, McCallum, et al.
- 2000
(Show Context)
Citation Context ...which representative training data are available. Naïve Bayes methods have been extended to situations in which amixture of labeled training data and unlabeled data are used for text classification (=-=Nigam et al. 2000-=-). Parameter estimation was done using a version of the EM algorithm that is effectively identical to that used by Winkler (2000) and Larsen and Rubin (2001) when training data are not available. In t... |

639 | Greedy function approximation: a gradient boosting machine,” Annals of Statistics 29(5):1189–1232
- Friedman
- 2001
(Show Context)
Citation Context ...e these two tails of distributions, then we can accurately estimate error rates at differing levels. This is known to be an exceptionally difficult problem (e.g. Vapnik 1999, Hastie, Thibshirani, and =-=Friedman 2001-=-). Our comparisons consist of a set of figures in which we compare a plot of thecumulative distribution of estimates of matches versus the true cumulative distribution with the truth represented by t... |

562 | Inductive Learning Algorithms and Representations for Text Categorization
- Dumais, Piatt, et al.
(Show Context)
Citation Context ...yes Nets, the large dimensionality (from 1,000 to 200,000) often rules out using methods that account for dependencies between identifying variables. Accounting for two-way dependencies (Sahami 1996, =-=Dumais et al. 1998-=-) did not yield improved classification rules for Bayesian Networks. Accounting for selected interactions involving two or more interactions did improve classification rules (Winkler 2000). Record lin... |

238 | The state record linkage and current research problems - Winkler - 1999 |

223 | The Bayesian structural EM algorithm - Friedman - 1998 |

183 | Advances in RecordLinkage methodology as Applied to the 1985 - Jaro - 1989 |

133 | Learning belief networks in the presence of missing values and hidden variables
- Friedman
- 1997
(Show Context)
Citation Context ...ble, then both can efficiently estimate parameters of interest. When missing data are present, the EM algorithm can be used for parameter estimation in Bayesian Networks when there are training data (=-=Friedman 1997-=-) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Beli... |

114 | Learning limited dependence Bayesian classifiers
- Sahami
- 1996
(Show Context)
Citation Context ...cations of Bayes Nets, the large dimensionality (from 1,000 to 200,000) often rules out using methods that account for dependencies between identifying variables. Accounting for two-way dependencies (=-=Sahami 1996-=-, Dumais et al. 1998) did not yield improved classification rules for Bayesian Networks. Accounting for selected interactions involving two or more interactions did improve classification rules (Winkl... |

108 | The State of Record Linkage and - Winkler - 1999 |

69 | Advanced Methods for Record Linkage
- Winkler
- 1994
(Show Context)
Citation Context ...s are concentrated. Multiple blocking passes are needed to find duplicates in a subsequent blocking pass that are not found on a prior pass. Due to high typographical error rates in most files (e.g., =-=Winkler 1994-=-, 1995), it is quite unusual to find all matches in just one blocking pass. Unlike general text classification, in record linkage it is quite feasible to use an initial guess of parameters associated ... |

51 | String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage
- Winkler
- 1990
(Show Context)
Citation Context ...ind additional relationships that have not been previously conceived and modeled. Generally, accounting for partial agreement with string comparators makes dramatic improvements in matching efficacy (=-=Winkler 1990-=-b, 1995). From one pair of files to the next, typographical error rates can dramatically affect the probabilities P(agree field | M). For instance, in an urban area or a rural area, the P(agree first ... |

49 | Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage
- Winkler
- 1988
(Show Context)
Citation Context ...ameterestimation algorithms. Unsupervised learning methods have typically performed very poorly for general machine learning classification rules. The unsupervised learning methods of record linkage (=-=Winkler 1988-=-, 1993) performed relatively well because they were applied in a few situations that were extremely favorable. Five conditions are favorable application of the unsupervised EM methods. The first is th... |

37 | Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage
- Winkler
- 1993
(Show Context)
Citation Context ...e not available. In the latter situations, assumption CI was not needed. In record linkage, it is known that dropping assumption CI can yield better classification rules and estimates of error rates (=-=Winkler 1993-=-, Larsen and Rubin 2001). In text classification and other general applications of Bayes Nets, the large dimensionality (from 1,000 to 200,000) often rules out using methods that account for dependenc... |

30 | An Application of the Fellegi-Sunter Model of record linkage to the - WE, Thibaudeau - 1990 |

29 | Record linkage techniques - Alvey, Jamerson - 1997 |

26 | Re-identification methods for evaluating the confidentiality of analytically valid microdata - Winkler - 1998 |

22 |
Iterative automated record linkage using mixture models
- Larsen, Rubin
- 2001
(Show Context)
Citation Context ...nkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (Belin and Rubin 1995, =-=Larsen and Rubin 2001-=-). Keywords: likelihood ratio, Bayesian Nets, EM Algorithm 1. INTRODUCTION Record linkage is the science of finding matches or duplicates within or across files. Matches are typically delineated using... |

19 |
Maximum likelihood via the ECM algorithm: A general framework. Biometrika
- Meng, Rubin
- 1993
(Show Context)
Citation Context .../ Mtl; and, if k≠j , pt+1(i,k) = pt(i,k) Ek / Mk. 3. Repeat 1 and 2 for all classes Cj and all patterns i in Pj. Then each Ft is one cycle of iterative proportional fitting (e.g., Winkler 1989, 1993, =-=Meng and Rubin 1993-=-) and increases the likelihood. The last equation in step 2 assures that the new estimates add to a proper probability. If necessary, the procedure can be extended to general IProjections that also in... |

19 |
Near Automatic Weight Computation in the FellegiSunter Model of Record Linkage
- Winkler
- 1989
(Show Context)
Citation Context ... P(γ | M) and P(γ | U) using the EM algorithm. The EM algorithm is useful because it provides a means of optimally separating M and U. Better separation between M and U is possible when a general EM (=-=Winkler 1989-=-, 1993, Larsen 1996) that does not use assumption CI. The advantage of assumption CI is that it yields computational speed-ups on orders between 100 and 10,000 in contrast to methods that use dependen... |

17 |
A method for calibrating false - match rates in record linkage
- Belin, Rubin
- 1995
(Show Context)
Citation Context ...1997) and in record linkage when there are no training data (unsupervised learning). EM and MCMC methods can be used for automatically estimating error rates in some of the record linkage situations (=-=Belin and Rubin 1995-=-, Larsen and Rubin 2001). Keywords: likelihood ratio, Bayesian Nets, EM Algorithm 1. INTRODUCTION Record linkage is the science of finding matches or duplicates within or across files. Matches are typ... |

13 | Evaluation of Sources of Variation in Record Linkage through a Factorial Experiment - Belin - 1993 |

8 |
On Dykstra’s Iterative Fitting
- Winkler
- 1990
(Show Context)
Citation Context ...ind additional relationships that have not been previously conceived and modeled. Generally, accounting for partial agreement with string comparators makes dramatic improvements in matching efficacy (=-=Winkler 1990-=-b, 1995). From one pair of files to the next, typographical error rates can dramatically affect the probabilities P(agree field | M). For instance, in an urban area or a rural area, the P(agree first ... |

7 | The Discrimination Power of Dependency Structures in Record Linkage - Thibaudeau - 1993 |

5 | Sunter (1969), “A Theory for Record Linkage - Fellegi, B |

5 | Improving EM Algorithm Estimates for Record Linkage Parameters
- Yancey
- 2002
(Show Context)
Citation Context ...ble. Five conditions are favorable application of the unsupervised EM methods. The first is that the EM must be applied to sets of pairs in which the proportion of matches M is greater than 0.05 (see =-=Yancey 2002-=- for related work). The second is that one class (matches) must be relatively well-separated from the other classes. The third is that typographical error must be relatively low. For instance, if twen... |

4 |
Bayesian Approaches to Finite Mixture Models
- Larsen
- 1996
(Show Context)
Citation Context ...U) using the EM algorithm. The EM algorithm is useful because it provides a means of optimally separating M and U. Better separation between M and U is possible when a general EM (Winkler 1989, 1993, =-=Larsen 1996-=-) that does not use assumption CI. The advantage of assumption CI is that it yields computational speed-ups on orders between 100 and 10,000 in contrast to methods that use dependencies between variab... |

1 | Local Computation with Probabilities on Graphical Structures and Their Application to Expert Systems (with discussion - Laurtizen - 1989 |

1 |
On Discriminative vs. Generative classifiers: A comparison of logistic regression and naïve Bayes,” Neural Information Processing Systems 14
- Mitchell
- 1997
(Show Context)
Citation Context ... DC 20233-9100 Although terminology differs, there is considerable overlap between record linkage methods based on the Fellegi-Sunter model (JASA 1969) and Bayesian networks used in machine learning (=-=Mitchell 1997-=-). Both are based on formal probabilistic models that can be shown to be equivalent in many situations (Winkler 2000). When no missing data are present in identifying fields and training data are avai... |

1 |
Frequency Dependent Probability Measures for Record Linkage
- Yancey
- 2000
(Show Context)
Citation Context ...ta may help in finding better estimates of error rates. If high quality, current geographic identifiers are associated with records, then accounting for frequency may not help matching (Winkler 1989, =-=Yancey 2000-=-). Across larger geographic regions (e.g., an entire ZIP code or County or State), accounting for frequency may improve matching efficacy. 3. METHODS AND DATA Our main theoretical method is to use the... |