## The State of Record Linkage and Current Research Problems (1999)

Venue: | Statistical Research Division, U.S. Census Bureau |

Citations: | 218 - 7 self |

### BibTeX

@TECHREPORT{Winkler99thestate,

author = {William E. Winkler},

title = {The State of Record Linkage and Current Research Problems},

institution = {Statistical Research Division, U.S. Census Bureau},

year = {1999}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage today. Record linkage research is characterized by its synergism of statistics, computer science, and operations research. Many difficult algorithms have been developed and put in software systems. Record linkage practice is still very limited. Some limits are due to existing software. Other limits are due to the difficulty in automatically estimating matching parameters and error rates, with current research highlighted by the work of Larsen and Rubin. Keywords: computer matching, modeling, iterative fitting, string comparison, optimization RsSUMs Cet article donne une vue d'ensemble sur les ...

### Citations

8090 | Maximum likelihood from incomplete data via the EM algorithm - Dempster, Laird, et al. - 1977 |

1138 | Spatial interaction and the statistical analysis of lattice systems - Besag - 1974 |

854 | A tutorial on learning with bayesian networks - Heckerman - 1995 |

804 | Text classification from labeled and unlabeled documents using
- Nigam, McCallum, et al.
- 2000
(Show Context)
Citation Context ...ent groups. Bayesian networks are one of the standard tools in data mining. They are also used for information retrieval methods such as used in some of the web search engines. The latest algorithms (=-=Nigam et al., 1999-=-) utilize EM-based methods that are closely related tosmethods used by Winkler (1988, 1989, 1993b) and Larsen and Rubin (1999). The EM-based algorithms for finding maximum likelihood estimates in the ... |

625 | Statistical Analysis of Finite Mixture Distributions - Titterington, Smith, et al. - 1985 |

186 | Discrete Multivariate Analysis - Bishop, Fienberg, et al. - 1975 |

176 | On the statistical analysis of dirty pictures (with discussion - Besag - 1986 |

142 | Advances in record linkage methodology as applied to matching the 1985 census of tampa - Jaro - 1984 |

131 |
Automatic linkage of vital records
- Newcombe, Kennedy, et al.
- 1959
(Show Context)
Citation Context ...demiological and survey applications. Very recent work is in the related areas of information retrieval and data mining. The ideas of modern record linkage originated with geneticist Howard Newcombe (=-=Newcombe et al. 1959-=-, 1962) who introduced odds ratios of frequencies and the decision rules for delineating 1 William E. Winkler, Statistical Research Division, Room 3000-4, Bureau of the Census, Washington, DC, 20233-9... |

92 | Matching and Record Linkage - Winkler - 1997 |

64 | Advanced methods for record linkage - Winkler - 1994 |

63 | Using EM to obtain asymptotic variancecovariance matrices: the SEM algorithm - Meng, Rubin - 1991 |

62 | Automatic Spelling Correction in Scientific and Scholarly Text - Pollock, Zamora - 1984 |

60 |
A method for limiting disclosure of microdata based on random noise and transformation
- Kim
- 1986
(Show Context)
Citation Context ...n straightforward to apply. The associated research problems relate to how seriously analytic properties are compromised. Additive noise is known to preserve some of the analytic properties of files (=-=Kim 1986-=-, 1989; Fuller 1993). Research problems are whether general software can be developed and whether files are free of disclosures. Combinations of additive noise and limited swapping have been used by K... |

53 |
Masking procedures for microdata disclosure limitation
- Fuller
- 1993
(Show Context)
Citation Context ...d to apply. The associated research problems relate to how seriously analytic properties are compromised. Additive noise is known to preserve some of the analytic properties of files (Kim 1986, 1989; =-=Fuller 1993-=-). Research problems are whether general software can be developed and whether files are free of disclosures. Combinations of additive noise and limited swapping have been used by Kim and Winkler (199... |

45 | Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage
- Winkler
- 1988
(Show Context)
Citation Context ...ctly in the 3-variable, conditional independence case. More generally, in the conditional independence situation, the parameters can be computed via a straightforward application of the EM algorithm (=-=Winkler 1988-=-). If the conditional independence assumption does not hold, then the parameters can be computed by generalized EM methods (Winkler 1988, 1989a, 1993b, Armstrong and Mayda 1993, see also Meng and Rubi... |

45 | String Comparator Metrics and Enhanced Decision Rules in the FellegiSunter Model of Record Linkage
- Winkler
- 1990
(Show Context)
Citation Context ...ting parameters under conditional independence when non-1-1 (or 1-1) matching is done? Parameter estimates obtained under the conditional independence EM can be superior to other parameter estimates (=-=Winkler, 1990-=-b) and can be obtained more easily. The conventional methods estimate the marginal probabilities P(agree field | M) and P(agree field | U) directly using samples for which truth has been obtained via ... |

42 | Handbook of Record linkage: Methods for Health and - Newcombe - 1988 |

40 | On a method of estimating birth and death rates and the extent of registration - Deming, E - 1949 |

38 |
A modified random perturbation method for database security
- Tendick, Matloff
- 1994
(Show Context)
Citation Context ...can be developed and whether files are free of disclosures. Combinations of additive noise and limited swapping have been used by Kim and Winkler (1995) and Winkler (1998). Data perturbation methods (=-=Tendick and Matloff 1994-=-) are closely related to additive noise. The methods are good at preserving confidentiality and yielding totals on a number of subdomains that are consistent with unreleased confidential data. The bas... |

35 | A Bayesian approach to data disclosure: optimal intruder behavior for continuous data - Fienberg, Makov, et al. - 1997 |

33 | Improved decision rules in the Fellegi-Sunter model of record linkage
- Winkler
- 1993
(Show Context)
Citation Context ...s due to Belin and Rubin (1995). Although the method of Belin and Rubin requires calibration data, it is known to work well in a narrow range of situations (Winkler and Thibaudeau, 1991; Scheuren and =-=Winkler, 1993-=-). The situations are those in which there is substantial separation of the curves of log frequency versus matching weight for matches and nonmatches. Generally, good separation of curves occurs with ... |

25 | Approximate string comparison and its effect on an advanced record linkage system - Porter, Winkler - 1997 |

23 | Re-identification methods for evaluating the confidentiality of analytically valid microdata
- Winkler
- 1998
(Show Context)
Citation Context ...s that agencies not release individually identifiable data. If a public-use file is created, then agencies must determine if the file meets analytic needs and is confidential. Record linkage methods (=-=Winkler 1998-=-) that employ new metrics for comparing somewhat related quantitative data provide a useful enhancement and yield higher re-identification rates than less sophisticated methods. If an agency can effec... |

22 |
Regression Analysis of Data Files That Are Computer Matched - Part I
- Scheuren, Winkler
- 1999
(Show Context)
Citation Context ...al to these is due to Belin and Rubin (1995). Although the method of Belin and Rubin requires calibration data, it is known to work well in a narrow range of situations (Winkler and Thibaudeau, 1991; =-=Scheuren and Winkler, 1993-=-). The situations are those in which there is substantial separation of the curves of log frequency versus matching weight for matches and nonmatches. Generally, good separation of curves occurs with ... |

19 |
OX-LINK: The Oxford Medical Record Linkage System
- Gill
- 1997
(Show Context)
Citation Context ...tches a large universe file with itself, then the second ratio is a good approximation of the first ratio. Newcombe’s ideas have been extended in a variety of ways (e.g., Newcombe et al., 1988, 1992=-=, Gill 1999-=-) Fellegi and Sunter (1969) introduced a formal mathematical foundation for record linkage. To begin, notation is needed. Two files A and B are matched. The idea is to classify pairs in a product spac... |

19 |
Maximum likelihood via the ECM algorithm: A general framework. Biometrika
- Meng, Rubin
- 1993
(Show Context)
Citation Context ...(Winkler 1988). If the conditional independence assumption does not hold, then the parameters can be computed by generalized EM methods (Winkler 1988, 1989a, 1993b, Armstrong and Mayda 1993, see also =-=Meng and Rubin 1993-=-), by scoring (Thibaudeau 1993), and by Gibbs sampling (Larsen 1996, Larsen and Rubin 1999). The methods of Larsen and Rubin (1999) are the most general. These methods can yield more accurate matching... |

19 | Near Automatic Weight Computation in the FellegiSunter Model of Record Linkage - Winkler - 1989 |

18 |
A method for data-oriented multivariate microaggregation
- Mateo-Sanz, Domingo-Ferrer
- 1998
(Show Context)
Citation Context ...sses of functions on arbitrary subdomains. A basic research question is whether these methods can produce the types of information that users of the public-use files need. Microaggregation (see e.g., =-=Mateo-Sanz and Domingo-Ferrer 1998) -=-is a method of replacing values of individual variables in ranges with means. The algorithms can be quite sophisticated. The research questions are: “Do these methods compromise analytic validity se... |

17 |
Iterative Automated Record Linkage Using Mixture Models
- Larsen, Rubin
- 2001
(Show Context)
Citation Context ...eters can be computed by generalized EM methods (Winkler 1988, 1989a, 1993b, Armstrong and Mayda 1993, see also Meng and Rubin 1993), by scoring (Thibaudeau 1993), and by Gibbs sampling (Larsen 1996, =-=Larsen and Rubin 1999-=-). The methods of Larsen and Rubin (1999) are the most general. These methods can yield more accurate matching parameters and better decision rules. These parameter-estimation methods do not always yi... |

16 | A method for calibrating false - match rates in record linkage - Belin, Rubin - 1995 |

14 |
Testing in latent class models using a posterior predictive check distribution
- Rubin, Stern
- 1994
(Show Context)
Citation Context ...inkler (1994) and Larsen and Rubin (1999). The estimation methods and the means of evaluating the fits of the latent class models are quite difficult because the usual Chi-square methods do not work (=-=Rubin and Stern, 1993). T-=-he basic research question is “How does one automatically estimate error rates?” 4. ADVANCED RESEARCH PROBLEMS Three areas use methods and underlying models that are closely related to the basic i... |

14 |
Computational disclosure control for medical microdata: the Datafly system
- Sweeney
- 1997
(Show Context)
Citation Context ...ods for masking data are intended to make re-identification more difficult. Existing masking methods cover a variety of areas. Global recoding and local suppression (DeWaal and Willenborg 1996, 1998; =-=Sweeney, 1999-=-) have been successfully used to create public-uses files and other security procedures. The advantage of the methods is that available general software is often straightforward to apply. The associat... |

13 | Evaluation of Sources of Variation in Record Linkage through a Factorial Experiment - Belin - 1993 |

11 | Frequency-Based Matching in the Fellegi-Sunter Model of Record Linkage - Winkler - 1989 |

10 | Record linkage: statistical models for matching computer records - Copas, Hilton - 1990 |

8 | Subdomain Estimation for the Masked Data - Kim - 1990 |

8 | Fitting Log-Linear Models When Some Dichotomous Variables are Unobservable - Thibaudeau - 1989 |

8 | Methods for Adjusting for Lack of Independence in an Application of the Fellegi-Sunter Model of Record Linkage - Winkler - 1989 |

7 | The effect of mismatching on the measurement of response errors - Neter, Maynes, et al. - 1965 |

7 | The Discrimination Power of Dependency Structures in Record Linkage
- Thibaudeau
- 1993
(Show Context)
Citation Context ... independence assumption does not hold, then the parameters can be computed by generalized EM methods (Winkler 1988, 1989a, 1993b, Armstrong and Mayda 1993, see also Meng and Rubin 1993), by scoring (=-=Thibaudeau 1993-=-), and by Gibbs sampling (Larsen 1996, Larsen and Rubin 1999). The methods of Larsen and Rubin (1999) are the most general. These methods can yield more accurate matching parameters and better decisio... |

6 | Lenstra (Eds - Aarts, Lenstra - 1997 |

6 | Sunter (1969), "A Theory for Record Linkage - Fellegi, B |

6 |
Approximate String Comparison
- Hall, Dowling
- 1980
(Show Context)
Citation Context ...ngs exactly (character-bycharacter) because of typographical error. Dealing with typographical error via approximate string comparison has been a major research project in computer science (see e.g., =-=Hall and Dowling, 1980-=-). In record linkage, one needs to have a function that represents approximate agreement, with agreement being represented by 1 and degrees of partial agreement being represented by numbers between 0 ... |

6 | Probabilistic methods in matching Census samples to the National Death Index - Rogot, Sorlie, et al. - 1986 |

6 | Recursive Analysis of Linked Data Files - Winkler, Scheuren - 1996 |

5 | Record Linkage and Public Policy: A Dynamic Evolution - Fellegi - 1997 |

5 | Disclosure Limitation and Related Methods for Categorical Data - Fienberg, Makov, et al. - 1998 |

5 | Additive Logistic Regression: a Statistical - Friedman, Hastie, et al. |

5 | Methods of Computer Linkage of Hospital Admission-Separation Records into Cumulative Health Histories - Smith, Newcombe - 1975 |