Results 11 - 20
of
22
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
, 2004
"... rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initia ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
rship (Xing et al. 2003), and relative comparisons (Schultz & Joachims 2004). These approaches have shown improvements over traditional similarity functions for different data types such as vectors in Euclidean space, strings, and database records composed of multiple text fields. While these initial results are encouraging, there still remains a large number of similarity functions that are currently unable to adapt to a particular domain. In our research, we attempt to bridge this gap by developing both new learnable similarity functions and methods for their application to particular problems in machine learning and data mining. In preliminary work, we proposed two learnable similarity functions for strings that adapt distance computations given training pairs of equivalent and non-equivalent strings (Bilenko & Mooney 2003a). The first function is based on a probabilistic model of edit distance with affine gaps (Gus- Copyright c # 2004, American Association for Artificial Intelli
Regression Analysis with Linked Data
, 2004
"... Record linkage, or exact matching, can be used to join together two files that contain information on the same individuals, but lack unique personal identification codes. The possibility of errors in linkage causes problems for estimating the relationships between variables on the two files. The eff ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Record linkage, or exact matching, can be used to join together two files that contain information on the same individuals, but lack unique personal identification codes. The possibility of errors in linkage causes problems for estimating the relationships between variables on the two files. The effect is analogous to the impact of measurement error. A model of a linear regression relationship between variables in linked files is proposed. Assuming the probabilities that pairs of records are links are known, an unbiased estimator of the regression coefficients is derived. Methods for estimating the linkage probabilities by using mixture models are discussed. A consistent estimator of the covariance matrix of the proposed estimator is proposed. A bootstrap estimator is used to reflect the impact of the uncertainty in record linkage model parameters on the estimators of the regression parameters. A simulation study compares the performance of the proposed estimator and alternatives.
Automatic Identity Recognition in the Semantic Web ⋆
"... Abstract. The OKKAM initiative 1 has recently highlighted the need of moving from the traditional web towards a “web of entities”, where real-world objects descriptions could be retrieved, univocally identified, and shared over the web. In this paper, we propose our vision of the entity recognition ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. The OKKAM initiative 1 has recently highlighted the need of moving from the traditional web towards a “web of entities”, where real-world objects descriptions could be retrieved, univocally identified, and shared over the web. In this paper, we propose our vision of the entity recognition problem and, in particular, we propose methods and techniques to capture the “identity ” of a real entity in the Semantic Web. We claim that automatic techniques are needed to compare different RDF descriptions of a domain with the goal of automatically detect heterogeneous descriptions of the same real-world objects. Problems and techniques to solve them are discussed together with some experimental results on a real case study on web data. 1
On Bayesian Record Linkage
- In Sixth International World Meeting on Bayesian Analysis
, 2000
"... Record linkage refers to the use of an algorithmic technique to match records from different data sets that correspond to the same statistical unit (Belin and Rubin, 1995). In this paper we propose a fully Bayesian approach to record linkage. We use standard Metropolis-Hastings and Simulated Anneali ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Record linkage refers to the use of an algorithmic technique to match records from different data sets that correspond to the same statistical unit (Belin and Rubin, 1995). In this paper we propose a fully Bayesian approach to record linkage. We use standard Metropolis-Hastings and Simulated Annealing algorithms to derive the marginal posterior distribution of a matrix-valued parameter which indicates the "configuration" of matches between the two lists. We suggest to use different inferential summaries of the posterior: in particular we discuss the use of the posterior mode. Alternatively we sketch the possibility of using a formal Bayesian decision theory approach. Keywords: FALSE MATCH RATE; INTEGRATION OF DATA SOURCES; BAYESIAN DECISION RULES; MCMC. 1.
A conditional model of deduplication for multi-type relational data
, 2005
"... Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the merge decisions made for records of other types. Whereas nearly all previous approaches have merged ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the merge decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. We evaluate the system on two citation matching datasets, for which we deduplicate both papers and venues. We show that by collectively deduplicating paper and venue records, we obtain up to a 30 % error reduction in venue deduplication, and up to a 20 % error reduction in paper deduplication over competing methods. 1
A Contingency-Table Model for Imputing Data Satisfying Analytic Constraints
, 2003
"... This paper describes a method for imputation in general contingency tables when the imputations are subject to both analytic (edit) constraints and probabilistic distributional constraints. The model extends edit ideas in Fellegi and Holt (1976) and Winkler and Chen (2002). The model extends miss ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper describes a method for imputation in general contingency tables when the imputations are subject to both analytic (edit) constraints and probabilistic distributional constraints. The model extends edit ideas in Fellegi and Holt (1976) and Winkler and Chen (2002). The model extends missing-at-random imputation ideas in Little and Rubin (1987). Some of the ideas are related to Friedman (2001) and Thibaudeau and Winkler (2002). Keywords: hot-deck, loglinear models, set-covering 1.
Automatically Estimating Record Linkage False Match Rates
, 2007
"... This paper provides a mechanism for automatically estimating record linkage false match rates in situations where the subset of the true matches is reasonably well separated from other pairs and there is no training data. The method provides an alternative to the method of Belin and Rubin (JASA 1995 ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper provides a mechanism for automatically estimating record linkage false match rates in situations where the subset of the true matches is reasonably well separated from other pairs and there is no training data. The method provides an alternative to the method of Belin and Rubin (JASA 1995) and is applicable in more situations. We provide examples demonstrating why the general problem of error rate estimation (both false match and false nonmatch rates) is likely impossible in situations without training data and exceptionally difficult even in the extremely rare situations when training data are available.
FEDERAL HOUSING ADMINISTRATION PREPARED FOR: SOCIETY OF ACTUARIES ANNUAL MEETING
, 2004
"... There are a number of reasons why data quality is important to business and government: 1. High-quality data can be a major business asset, a unique source of competitive advantage. 2. Poor quality data can lower customer satisfaction. 3. Poor quality data can lower employee job satisfaction. 4. Poo ..."
Abstract
- Add to MetaCart
There are a number of reasons why data quality is important to business and government: 1. High-quality data can be a major business asset, a unique source of competitive advantage. 2. Poor quality data can lower customer satisfaction. 3. Poor quality data can lower employee job satisfaction. 4. Poor quality data can breed organizational mistrust. The August 2003 issue of The Newsmonthly of the American Academy of Actuaries reports that the National Association of Insurance Commissioners (NAIC) suggests that actuaries audit “controls related to the completeness, accuracy, and classification of loss data”. There is little published work on data quality in the actuarial literature. There are, however, several texts and a large number of published papers on data quality in related disciplines, especially, statistics and computer science. In Section 2 of this work, we discuss some data quality issues as they relate directly to practical
IFSA-EUSFLAT 2009 Semantical evaluators
"... Abstract — In the context of a possibilistic framework for detection of object co-reference, evaluators have been defined as operators that compare two values and express the belief that such values are co-referent. Hereby, co-reference of two values means that these values describe the same entity ..."
Abstract
- Add to MetaCart
Abstract — In the context of a possibilistic framework for detection of object co-reference, evaluators have been defined as operators that compare two values and express the belief that such values are co-referent. Hereby, co-reference of two values means that these values describe the same entity in the real world. In this paper, a class of evaluators is investigated that determines the belief of co-reference based on semantical connections between values of the universe. These semantical connections are modeled by means of binary relations. In case these binary relations are not a-priori given, they can be (partially) learned in an iterative co-reference detection schema.
Chapter Modeling Issues and the Use of Experience in Record Linkage
"... The goal of record linkage is to link quickly and accurately records corresponding to the same person or entity. Fellegi and Sunter (1969) proposed a statistical model for record linkage that assumes pairs of entries, one from each of two files, either are matches corresponding to a single person or ..."
Abstract
- Add to MetaCart
The goal of record linkage is to link quickly and accurately records corresponding to the same person or entity. Fellegi and Sunter (1969) proposed a statistical model for record linkage that assumes pairs of entries, one from each of two files, either are matches corresponding to a single person or nonmatches arising from two different people. Certain patterns of agreements and disagreements on variables in the two files are more likely among matches than among nonmatches. The observed patterns can be viewed as arising from a mixture distribution. Mixture models, which for discrete data are generalizations of latent-class models, can be fit to comparison patterns in order to find matching and nonmatching pairs of records. Mixture models, when used with data from the U.S. Decennial Census — Post Enumeration Survey, quickly give accurate results. A critical issue in new record-linkage problems is determining when the mixture models consistently identify matches and nonmatches, rather than some other division of the pairs of records. A method that uses information based on experience, identifies records to review, and incorporates clerically-reviewed data is proposed.

