Results 1 - 10
of
19
A comparison of string distance metrics for name-matching tasks
, 2003
"... Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, tok ..."
Abstract
-
Cited by 243 (9 self)
- Add to MetaCart
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community.
A Comparison of String Metrics for Matching Names and Records
- KDD WORKSHOP ON DATA CLEANING AND OBJECT CONSOLIDATION
, 2003
"... We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-dist ..."
Abstract
-
Cited by 64 (4 self)
- Add to MetaCart
We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We then describe an extension to the toolkit which allows records to be compared. We discuss
Advanced Methods For Record Linkage
, 1994
"... s Service. The study showed that the fewest errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. The enhancement basically consisted of adjusting the string comparator value upward by a fixed amount i ..."
Abstract
-
Cited by 59 (14 self)
- Add to MetaCart
s Service. The study showed that the fewest errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. The enhancement basically consisted of adjusting the string comparator value upward by a fixed amount if the first four characters agreed; by lesser amounts if the first three, two, or one characters agreed. The string comparator examined by Budzkinsky (1991) consisted of the Jaro comparator with only the Winkler enhancement. The final enhancement due to Lynch and Winkler (1994) adjusts the string comparator value if the strings are longer than six characters and more than half the characters beyond the first four 4 agree. The final enhancement was based on detailed comparisons between versions of the comparator. The comparisons involved tens of thousands of pairs of last names, first names, and street names that did not agree on a character-by-character basis but were associated with truly...
Improved Decision Rules In The Fellegi-Sunter Model Of Record Linkage
- Proceedings of the Section on Survey Research Methods, American Statistical Association
, 1993
"... Many applications of the Fellegi-Sunter model use simplifying assumptions and ad hoc modifications to improve matching efficacy. Because of model misspecification, distinctive approaches developed in one application typically cannot be used in other applications and do not always make use of advance ..."
Abstract
-
Cited by 29 (12 self)
- Add to MetaCart
Many applications of the Fellegi-Sunter model use simplifying assumptions and ad hoc modifications to improve matching efficacy. Because of model misspecification, distinctive approaches developed in one application typically cannot be used in other applications and do not always make use of advances in statistical and computational theory. An ExpectationMaximization (EMH) algorithm that constrains the estimates to a convex subregion of the parameter space is given. The EMH algorithm provides probability estimates that yield better decision rules than unconstrained estimates. The algorithm is related to results of Meng and Rubin (1993) on Multi-Cycle Expectation-Conditional Maximization algorithms and make use of results of Haberman (1977) that hold for large classes of loglinear models. Key Words: MCECM Algorithm, Latent Class, Computer Matching, Error Rate This paper provides a theory for obtaining constrained maximum likelihood estimates for latent-class, loglinear models on finite ...
Record Linkage: Current Practice and Future Directions
- CSIRO Mathematical and Information Sciences
, 2003
"... Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabil ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.
On Evaluation and Training-Set Construction for Duplicate Detection
- PROCEEDINGS OF THE KDD-2003 WORKSHOP ON DATA CLEANING, RECORD LINKAGE, AND OBJECT CONSOLIDATION, WASHINGTON DC
, 2003
"... A variety of experimental methodologies have been used to evaluate the accuracy of duplicate-detection systems. We advocate presenting precision-recall curves as the most informative evaluation methodology. We also discuss a number of issues that arise when evaluating and assembling training data fo ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
A variety of experimental methodologies have been used to evaluate the accuracy of duplicate-detection systems. We advocate presenting precision-recall curves as the most informative evaluation methodology. We also discuss a number of issues that arise when evaluating and assembling training data for adaptive systems that use machine learning to tune themselves to specific applications. We consider several different application scenarios and experimentally examine the effectiveness of alternative methods of collecting training data under each scenario. We propose two new approaches to collecting training data called static-active learning and weaklylabeled non-duplicates, and present experimental results on their effectiveness.
Masking Microdata Files
- Proceedings of the Survey Research Methods Section, American Statistical Association
, 1995
"... Government agencies collect many types of data, but due to confidentiality restrictions, use of the microdata is often limited to sworn agents working on secure computer systems at those agencies. These restrictions can severely affect public policy decisions made at one agency that has access to no ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Government agencies collect many types of data, but due to confidentiality restrictions, use of the microdata is often limited to sworn agents working on secure computer systems at those agencies. These restrictions can severely affect public policy decisions made at one agency that has access to nonconfidential summary statistics only. This necessitates creation of microdata which not only meets the confidentiality requirements but also has sufficient utility. This paper describes a general methodology for producing public-use data files that preserves confidentiality and allows many analytical uses. The methodology masks quantitative data using an additive-noise approach and then, when necessary, employs a reidentification/swapping methodology to assure confidentiality. One of the major advantages of this masking scheme is that it also allows obtaining precise subpopulation estimates, which is not possible with other known masking schemes. In addition, if controlled distortion is app...
Methods for evaluating and creating data quality
- Information Systems
, 2003
"... This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across files. Published by Elsevier Ltd.
Estimating the probability of events that have never occurred: when is your vote decisive
- Journal of the American Statistical Association
, 1998
"... Researchers sometimes argue that statisticians have little to contribute when few realizations of the process being estimated are observed. We show that this argument is incorrect even in the extreme situation of estimating the probabilities of events so rare that they have never occurred. We show h ..."
Abstract
-
Cited by 16 (11 self)
- Add to MetaCart
Researchers sometimes argue that statisticians have little to contribute when few realizations of the process being estimated are observed. We show that this argument is incorrect even in the extreme situation of estimating the probabilities of events so rare that they have never occurred. We show how statistical forecasting models allow us to use empirical data to improve inferences about the probabilities of these events. Our application is estimating the probability that your vote will be decisive in a U.S. presidential election, a problem that has been studied by political scientists for more than two decades. The exact value of this probability is of only minor interest, but the number has important implications for understanding the optimal allocation of campaign resources, whether states and voter groups receive their fair share of attention from prospective presidents, and how formal "rational choice" models of voter behavior might be able to explain why people vote at all. We show how the probability of a decisive vote can be estimated empirically from state-level forecasts of the presidential election and illustrate with the example of 1992. Based on generalizations of standard political science forecasting models, we estimate the (prospective) probability of a single vote being decisive as about 1 in 10 million for close national elections such as 1992, varying by about a factor of 10 among states. Our results support the argument that subjective probabilities of many types are best obtained through empirically based statistical prediction models rather than solely through mathematical reasoning. We discuss the implications of our findings for the types of decision analyses used in public choice studies.
Recursive Analysis Of Linked Data Files
- Proceedings of the 1996 Census Bureau Annual Research Conference
, 1996
"... This paper demonstrates a methodology for analyzing two or more files when the only common information is name and address that is subject to significant error. Such a situation might arise with lists of businesses. We assume that a small proportion of records can be accurately matched. With the mat ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This paper demonstrates a methodology for analyzing two or more files when the only common information is name and address that is subject to significant error. Such a situation might arise with lists of businesses. We assume that a small proportion of records can be accurately matched. With the matched pairs we build an edit/imputation model and add predicted quantitative values, via a regression analysis to each file. Matching is then repeated with the common quantitative data and with name and address information. If necessary, the edit/impute, regression, and matching steps can be repeated in a recursive fashion. In large measure the ideas of Neter, Maynes, and Ramanathan (1965) are revised but with new tools. KEYWORDS Edit, Imputation, Record Linkage, Regression Analysis, Recursive Processes 1. INTRODUCTION To make the best decisions, researchers and policymakers often need more information than is available in a single data base or in summary statistics from multiple files. S...

