Results 1 - 10
of
10
Overview of record linkage and current research directions
- BUREAU OF THE CENSUS
, 2006
"... This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research. ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.
Robust Identification of Fuzzy Duplicates
- In ICDE
, 2005
"... Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more a ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1.
ERACER: A Database Approach for Statistical Inference and Data Cleaning
"... Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into modern DBMSs. We present ERACER, an iterative statistical framework for inferring missing information and correcting such errors automatically. Our approach ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into modern DBMSs. We present ERACER, an iterative statistical framework for inferring missing information and correcting such errors automatically. Our approach is based on belief propagation and relational dependency networks, and includes an efficient approximate inference algorithm that is easily implemented in standard DBMSs using SQL and user defined functions. The system performs the inference and cleansing tasks in an integrated manner, using shrinkage techniques to infer correct values accurately even in the presence of dirty data. We evaluate the proposed methods empirically on multiple synthetic and real data sets. The results show that our framework achieves accuracy comparable to a baseline statistical method using Bayesian networks with exact inference. However, our framework has wider applicability than the Bayesian network baseline, due to its ability to reason with complex, cyclic relational dependencies.
Optimizing the Use of Micro-data: An Overview of the Issues.” Paper presented at the 2006 Joint Statistical Meetings as part of a session organized to honor Pat Doyle. Available at http://client.norc.org/jole/SOLEweb/Accesstomicrodata%5B1%5D.pdf
, 2005
"... Doyle and Laura Zayatz. Thanks also to Nick Greenia for extensive discussions on data quality and harm issues, Bill Winkler for alerting me to additional data quality and confidentiality literature, Miriam Heller and Sang Kim for their ideas about the relationahip between cyberinfrastructure and con ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Doyle and Laura Zayatz. Thanks also to Nick Greenia for extensive discussions on data quality and harm issues, Bill Winkler for alerting me to additional data quality and confidentiality literature, Miriam Heller and Sang Kim for their ideas about the relationahip between cyberinfrastructure and confidentiality, Nancy Lutz for her help in developing the model and Guy Almes, Fredrik Andersson, Matt Freedman, Cheryl Eavey, Nancy Gordon and Dan Weinberg for their suggestions. “ It is becoming clear that advances in technology and increased use of administrative records may, at some point in the future, render our current disclosure avoidance procedures inadequate. At the same time the … federal statistical system face[s] increasing demands for more, better and more recent data to meet critically important public policy and research needs. ” 2 “The extraordinary growth of electronic infrastructure, capacity, and use in the past decade has posed a profound new set of questions about the control, dissemination, power and use of information. On the one hand the high speed internet and the World Wide Web, email, electronic shopping, and cell phone use have opened up extraordinary new worlds of communication and are changing the way we work, play, and learn. On the other, as the electronic world enters our daily lives, the private space untouched by the intrusions of cyberspace and information seekers shrinks- for individuals, firms, and organizations. …There is also another challenge. The need to build more efficient surveillance networks to combat potential terrorist attack argues for less privacy for the individual person or firm to guarantee the security of the society in general. It is in this environment that citizens, business and technology leaders, and policy makers have to figure out how to understand, manage, and regulate the new cyberworld. ” 3
The Effects of Location Access Behavior on Re-identification Risk in a Distributed Envronment
"... Abstract. In this paper, we investigate how location access patterns influence the re-identification of seemingly anonymous data. In the real world, individuals visit different locations that gather similar information. For instance, multiple hospitals collect health information on the same patient. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. In this paper, we investigate how location access patterns influence the re-identification of seemingly anonymous data. In the real world, individuals visit different locations that gather similar information. For instance, multiple hospitals collect health information on the same patient. To protect anonymity for research purposes, hospitals share sensitive data, such as DNA sequences, stripped of explicit identifiers. Separately, for administrative functions, identified data, stripped of DNA, is made available. On a hospital by hospital basis, each pair of DNA and identified databases appears unlinkable, however, links can be established when multiple locations ’ database are studied. This problem, known as trail re-identification, is a generalized phenomenon and occurs because an individual’s location access pattern can be matched across the shared databases. Data holders can not exchange data to find and suppress trails that would be reidentified. Thus, it is important to assess the re-identification risk in a system in order to develop techniques to mitigate it. In this research, we evaluate several real world datasets and observe trail re-identification is related to the number of people to places. To study this phenomenon in more detail, we develop a generative model for location access patterns that simulates observed behavior. We evaluate trail re-identification risk in a range of simulated patterns and our findings suggest that the skew of the distribution of people to places is one of the main factors that drives trail re-identification. 1
EUSFLAT- LFA 2005 Towards the use of OWA operators for record linkage
"... Record linkage is used to establish links between those records that while belonging to two different files correspond to the same individual. Classical approaches assume that the two files contain some common variables, that are the ones used to link the records. Recently, we introduced a new appro ..."
Abstract
- Add to MetaCart
Record linkage is used to establish links between those records that while belonging to two different files correspond to the same individual. Classical approaches assume that the two files contain some common variables, that are the ones used to link the records. Recently, we introduced a new approach to link records among files when such common variables are not available. In this approach, reidentification is based on the so-called structural information. In this paper we study the use of OWA operators for extracting such structural information and, thus, allowing re-identification.
A Statistical Method for Integrated Data Cleaning and Imputation
"... Abstract — Real-world databases often contain both syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into standard DBMSs. This is primarily due to the broad scope of incorrect data values that are difficult to fully express using the general type ..."
Abstract
- Add to MetaCart
Abstract — Real-world databases often contain both syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into standard DBMSs. This is primarily due to the broad scope of incorrect data values that are difficult to fully express using the general types of constraints available. As a result many errors are subtle, and laborious to detect with manually-specified rules. However, combining statistical methods with extensions to conventional integrity constraints makes it possible to develop automated data cleaning methods for a variety of relational dependencies. In this work, we focus on exploiting the statistical dependencies among tuples in relational domains such as sensor networks, supply chain systems, and fraud detection. We identify potential statistical dependencies among the data values of related tuples and develop algorithms to automatically estimate these dependencies, utilizing them to jointly fill in missing values at the same time as identifying and correcting errors. The key features of our method are that (1) it uses an efficient approximate inference algorithm that is easily implemented in standard DBMSs and scales well to large databases sizes, and (2) it uses shrinkage and joint inference to accurately infer correct values even in the presence of both missing and corrupt values. We evaluate the method empirically on both synthetic and real-world genealogy data and compare to a baseline statistical method that uses Bayesian networks with exact inference. The results show that our algorithm achieves accuracy comparable to the baseline with respect to inferring missing values. However, our algorithm scales linearly rather than exponentially and can also simultaneously identify and correct corrupted values with high accuracy. I.
Data Cleaning and Imputation
"... Abstract — Real-world databases often contain both syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into standard DBMSs. This is primarily due to the broad scope of incorrect data values that are difficult to fully express using the general type ..."
Abstract
- Add to MetaCart
Abstract — Real-world databases often contain both syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into standard DBMSs. This is primarily due to the broad scope of incorrect data values that are difficult to fully express using the general types of constraints available. As a result many errors are subtle, and laborious to detect with manually-specified rules. However, combining statistical methods with extensions to conventional integrity constraints makes it possible to develop automated data cleaning methods for a variety of relational dependencies. In this work, we focus on exploiting the statistical dependencies among tuples in relational domains such as sensor networks, supply chain systems, and fraud detection. We identify potential statistical dependencies among the data values of related tuples and develop algorithms to automatically estimate these dependencies, utilizing them to jointly fill in missing values at the same time as identifying and correcting errors. The key features of our method are that (1) it uses an efficient approximate inference algorithm that is easily implemented in standard DBMSs and scales well to large databases sizes, and (2) it uses shrinkage and joint inference to accurately infer correct values even in the presence of both missing and corrupt values. We evaluate the method empirically on both synthetic and real-world genealogy data and compare to a baseline statistical method that uses Bayesian networks with exact inference. The results show that our algorithm achieves accuracy comparable to the baseline with respect to inferring missing values. However, our algorithm scales linearly rather than exponentially and can also simultaneously identify and correct corrupted values with high accuracy. I.
Object Oriented Intelligent Multi-Agent System Data Cleaning Architecture to clean Preference based Text Data
"... Agents are software programs that perform tasks on behalf of others and they are used to clean the text data with their characteristics. Agents are task oriented with the ability to learn by themselves and they react to the situation. Learning characteristics of an agent is done by verifying its pre ..."
Abstract
- Add to MetaCart
Agents are software programs that perform tasks on behalf of others and they are used to clean the text data with their characteristics. Agents are task oriented with the ability to learn by themselves and they react to the situation. Learning characteristics of an agent is done by verifying its previous experience from its knowledgebase. An agent concept is a complementary approach to the Object Oriented paradigm with respect to the design and implementation of the autonomous entities driven by beliefs, goals and plans. Preference based text data cleaning is based on the selection issue. Preferences are given by the user in the form of alphabets, numbers and special characters. Preference based Text data cleaning process transforms the given text data into structured database and extracts the required information using the given keyword. Agents incorporated in the architectural design of a Text data cleaning process combines the features of Multi-Agent System (MAS) Framework, MAS with Learning (MAS-L) Framework. MAS framework reduces the development time and the complexity of implementing the software agents. MAS-L framework incorporates the intelligence and learning properties of agents present in the system. MAS-L Framework makes use of the Decision Tree learning and an evaluation function to decide the next best decision that applies to the machine learning technique. This paper proposes the design for Multi-Agent based Data Cleaning Architecture that incorporates the structural design of agents into object model. The design of an architectural model for an Intelligent Multi-Agent based Data Cleaning inherits the features of the Multi-Agent System (MAS) and uses the MAS-L framework to design the intelligence and learning characteristics.

