Results 1 -
8 of
8
The State of Record Linkage and Current Research Problems
- Statistical Research Division, U.S. Census Bureau
, 1999
"... This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful id ..."
Abstract
-
Cited by 172 (7 self)
- Add to MetaCart
This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage today. Record linkage research is characterized by its synergism of statistics, computer science, and operations research. Many difficult algorithms have been developed and put in software systems. Record linkage practice is still very limited. Some limits are due to existing software. Other limits are due to the difficulty in automatically estimating matching parameters and error rates, with current research highlighted by the work of Larsen and Rubin. Keywords: computer matching, modeling, iterative fitting, string comparison, optimization RsSUMs Cet article donne une vue d'ensemble sur les ...
Matching and Record Linkage
- Business Survey Methods
, 1995
"... INTRODUCTION Matching has a long history of uses in statistical surveys and administrative data development. A business register consisting of names, addresses, and other identifying information such as total financial receipts might be constructed from tax and employment data bases (see chapters b ..."
Abstract
-
Cited by 77 (14 self)
- Add to MetaCart
INTRODUCTION Matching has a long history of uses in statistical surveys and administrative data development. A business register consisting of names, addresses, and other identifying information such as total financial receipts might be constructed from tax and employment data bases (see chapters by Colledge, Nijhowne, and Archer). A survey of retail establishments or agricultural establishments might combine results from an area frame and a list frame. To produce a combined estimator, units from the area frame would need to be identified in the list frame (see Vogel-Kott chapter). To estimate the size of a (sub)population via capture-recapture techniques, one needs to accurately determine units common to two or more independent listings (Sekar and Deming 1949; Scheuren 1983; Winkler 1989b). Samples must be drawn appropriately to estimate overlap (Deming and Gleser 1959). Rather than develop a special survey to collect data for policy decisions, it might be more appropriate t
Overview of record linkage and current research directions
- BUREAU OF THE CENSUS
, 2006
"... This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research. ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.
Improved Decision Rules In The Fellegi-Sunter Model Of Record Linkage
- Proceedings of the Section on Survey Research Methods, American Statistical Association
, 1993
"... Many applications of the Fellegi-Sunter model use simplifying assumptions and ad hoc modifications to improve matching efficacy. Because of model misspecification, distinctive approaches developed in one application typically cannot be used in other applications and do not always make use of advance ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
Many applications of the Fellegi-Sunter model use simplifying assumptions and ad hoc modifications to improve matching efficacy. Because of model misspecification, distinctive approaches developed in one application typically cannot be used in other applications and do not always make use of advances in statistical and computational theory. An ExpectationMaximization (EMH) algorithm that constrains the estimates to a convex subregion of the parameter space is given. The EMH algorithm provides probability estimates that yield better decision rules than unconstrained estimates. The algorithm is related to results of Meng and Rubin (1993) on Multi-Cycle Expectation-Conditional Maximization algorithms and make use of results of Haberman (1977) that hold for large classes of loglinear models. Key Words: MCECM Algorithm, Latent Class, Computer Matching, Error Rate This paper provides a theory for obtaining constrained maximum likelihood estimates for latent-class, loglinear models on finite ...
Frequency-based Matching in the Fellegi-Sunter Model of Record Linkage
- Proceedings of the Section on Survey Research Methods, American Statistical Association
, 1989
"... Bureau of the Census This paper extends techniques for frequency-based matching (see e.g., Fellegi and Sunter 1969). The extended techniques allow table-building under weaker assumptions than those typically used in practice. Although CPU requirements can increase, human intervention can be reduced ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Bureau of the Census This paper extends techniques for frequency-based matching (see e.g., Fellegi and Sunter 1969). The extended techniques allow table-building under weaker assumptions than those typically used in practice. Although CPU requirements can increase, human intervention can be reduced in some situations.
Record Linkage Software and Methods for Merging Administrative Lists
- Statistical Research Report Series No. RR/2001/03, Washington DC, US Bureau of the Census 2001
, 2001
"... National Statistical Institutes often have the need to merge administrative files from a variety of sources for which unique identifiers are not available to facilitate matching. Agencies such as Eurostat have the need to connect data sources from different countries and sources and to verify the co ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
National Statistical Institutes often have the need to merge administrative files from a variety of sources for which unique identifiers are not available to facilitate matching. Agencies such as Eurostat have the need to connect data sources from different countries and sources and to verify the confidentiality of microdata. To do this merging of administrative lists, agencies need fast software for cleaning up and standardizing lists and for merging the lists. The U.S. Bureau of the Census has software for name standardization, address standardization, and matching that are considered state-of-the-art. The standardization software breaks names and addresses into components that are easily compared. The matching software accounts for typographical error, automatically estimates matching parameters, and optimizes sets of assignments over large groups of pairs of records.
Approximate string comparator search strategies for very large administrative lists
- STATISTICAL RESEARCH DIVISION, U.S. CENSUS BUREAU
, 2005
"... Rather than collect data from a variety of surveys, it is often more efficient to merge information from administrative lists. Matching of person files might be done using name and date-of-birth as the primary identifying information. There are obvious difficulties with entities having a commonly oc ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Rather than collect data from a variety of surveys, it is often more efficient to merge information from administrative lists. Matching of person files might be done using name and date-of-birth as the primary identifying information. There are obvious difficulties with entities having a commonly occurring name such as John Smith that may occur 30,000+ times (1.5 for each date-of-birth). If there are 5 % typographical error in each field, then using fast character-by-character searches can miss 20 % of true matches among noncommonly occurring records where name plus date-ofbirth might be unique. This paper describes some existing solutions and current research directions.
Chapter Modeling Issues and the Use of Experience in Record Linkage
"... The goal of record linkage is to link quickly and accurately records corresponding to the same person or entity. Fellegi and Sunter (1969) proposed a statistical model for record linkage that assumes pairs of entries, one from each of two files, either are matches corresponding to a single person or ..."
Abstract
- Add to MetaCart
The goal of record linkage is to link quickly and accurately records corresponding to the same person or entity. Fellegi and Sunter (1969) proposed a statistical model for record linkage that assumes pairs of entries, one from each of two files, either are matches corresponding to a single person or nonmatches arising from two different people. Certain patterns of agreements and disagreements on variables in the two files are more likely among matches than among nonmatches. The observed patterns can be viewed as arising from a mixture distribution. Mixture models, which for discrete data are generalizations of latent-class models, can be fit to comparison patterns in order to find matching and nonmatching pairs of records. Mixture models, when used with data from the U.S. Decennial Census — Post Enumeration Survey, quickly give accurate results. A critical issue in new record-linkage problems is determining when the mixture models consistently identify matches and nonmatches, rather than some other division of the pairs of records. A method that uses information based on experience, identifies records to review, and incorporates clerically-reviewed data is proposed.

