• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Methods for evaluating and creating data quality (2004)

by W E Winkler
Venue:Inf. Syst
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 12
Next 10 →

Overview of record linkage and current research directions

by William E Winkler - BUREAU OF THE CENSUS , 2006
"... This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research. ..."
Abstract - Cited by 55 (1 self) - Add to MetaCart
This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.

Reasoning about Record Matching Rules

by Wenfei Fan
"... To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to ..."
Abstract - Cited by 20 (1 self) - Add to MetaCart
To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n 2) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods. 1.

Data Quality at a Glance

by Monica Scannapieco, Paolo Missier, Carlo Batini - Datenbank-Spektrum , 2005
"... The consequences of poor quality of data are often experienced in everyday life, but without making the necessary connections to its causes. For example, the late or missed delivery of a letter is often blamed on a dysfunctional postal service, although a closer look often ..."
Abstract - Cited by 11 (2 self) - Add to MetaCart
The consequences of poor quality of data are often experienced in everyday life, but without making the necessary connections to its causes. For example, the late or missed delivery of a letter is often blamed on a dysfunctional postal service, although a closer look often

Data Quality in Genome Databases

by Heiko Müller, Felix Naumann, Johann-christoph Freytag , 2003
"... Genome databases store data about molecular biological entities such as genes, proteins, diseases, etc. The main purpose of creating and maintaining such databases in commercial organizations is their importance in the process of drug discovery. Genome data is analyzed and interpreted to gain so- ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
Genome databases store data about molecular biological entities such as genes, proteins, diseases, etc. The main purpose of creating and maintaining such databases in commercial organizations is their importance in the process of drug discovery. Genome data is analyzed and interpreted to gain so-called leads, i.e., promising structures for new drugs. Following a lead through the process of drug development, testing, and finally several stages of clinical trials is extremely expensive. Thus, an underlying high quality database is of utmost importance. Due to the exploratory nature of genome databases, commercial and public, they are inaccurate, incomplete, outdated and in an overall poor state.

Beyond k-anonymity: A decision theoretic framework for assessing privacy risk

by Guy Lebanon, Monica Scannapieco, Mohamed R. Fouad, Elisa Bertino, Guy Lebanon, Monica Scannapieco, Mohamed R. Fouad, Elisa Bertino - In Privacy in statistical databases, Springer Lecture Notes in Computer Science
"... An important issue any organization or individual has to face when managing data containing sensitive information, is the risk that can be incurred when releasing such data. Even though data may be sanitized before being released, it is still possible for an adversary to reconstruct the original dat ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
An important issue any organization or individual has to face when managing data containing sensitive information, is the risk that can be incurred when releasing such data. Even though data may be sanitized before being released, it is still possible for an adversary to reconstruct the original data using additional information thus resulting in privacy violations. To date, however, a systematic approach to quantify such risks is not available. In this paper we develop a framework, based on statistical decision theory, that assesses the relationship between the disclosed data and the resulting privacy risk. We model the problem of deciding which data to disclose, in terms of deciding which disclosure rule to apply to a database. We assess the privacy risk by taking into account both the entity identification and the sensitivity of the disclosed information. Furthermore, we prove that, under some conditions, the estimated privacy risk is an upper bound on the true privacy risk. Finally, we relate our framework with the k-anonymity disclosure method. The proposed framework makes the assumptions behind k-anonymity explicit, quantifies them, and extends them in several natural directions. I.

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

by Peter Christen
"... Abstract. Linking records from two or more databases is becoming increasingly important in the data preparation step of many data mining projects, as linked data can enable analysts to conduct studies that are not feasible otherwise, or that would require expensive and timeconsuming collection of sp ..."
Abstract - Cited by 4 (3 self) - Add to MetaCart
Abstract. Linking records from two or more databases is becoming increasingly important in the data preparation step of many data mining projects, as linked data can enable analysts to conduct studies that are not feasible otherwise, or that would require expensive and timeconsuming collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the main challenges in record linkage is the accurate classification of record pairs into matches and non-matches. With traditional techniques, classification thresholds have to be set either manually or using an EM-based approach. Many modern classification techniques, on the other hand, are based on supervised machine learning and thus require training data, which is often not available in real world situations. A novel two-step approach to unsupervised record pair classification is presented in this paper. In the first step, training examples are selected automatically, and in the second step these examples are used to train a binary classifier. An experimental evaluation shows that this approach can outperform k-means clustering and can also be much faster than other classification techniques.

Towards an Open Source Toolkit for Building Record Linkage Workflows

by Marco Fortini, Monica Scannapieco, Laura Tosco, Tiziana Tuoto - In Proc. of the SIGMOD Workshop on Information Quality in Information Systems (IQIS’06 , 2006
"... Record linkage has been subject of research for several decades, and a huge number of record linkage solutions have been proposed, based on probabilistic and empirical paradigms. However, record linkage is a complex process, for the execution of which one single technique is often not enough; it can ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Record linkage has been subject of research for several decades, and a huge number of record linkage solutions have been proposed, based on probabilistic and empirical paradigms. However, record linkage is a complex process, for the execution of which one single technique is often not enough; it can be seen as composed by distinct phases, each requiring a specific technique and depending on given application and data requirements. Due to such complexity and application dependency, in this paper we propose a toolkit for record linkage, called RELAIS. The toolkit is based on the idea of choosing the most appropriate technique for each phase, and of combining such techniques in a dynamically built record linkage workflow. A real case study validates the RELAIS idea and provides a methodological pattern for driving the design of a record linkage workflow on the basis of the requirements of a real application. 1.

Social network analysis and mining for business applications

by Francesco Bonchi, Carlos Castillo, Aristides Gionis, Alejandro Jaimes - ACM Trans. Intell. Syst. Technol
"... Social network analysis has gained significant attention in recent years, largely due to the success of online social networking and media-sharing sites, and the consequent availability of a wealth of social network data. In spite of the growing interest, however, there is little understanding of th ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Social network analysis has gained significant attention in recent years, largely due to the success of online social networking and media-sharing sites, and the consequent availability of a wealth of social network data. In spite of the growing interest, however, there is little understanding of the potential business applications of mining social networks. While there is a large body of research on different problems and methods for social network mining, there is a gap between the techniques developed by the research community and their deployment in real-world applications. Therefore the potential business impact of these techniques is still largely unexplored. In this article we use a business process classification framework to put the research topics in a business context and provide an overview of what we consider key problems and techniques in social network analysis and mining from the perspective of business applications. In particular, we discuss data acquisition and preparation, trust, expertise, community structure, network dynamics, and information propagation. In each case we present a brief overview of the problem, describe state-of-the art approaches, discuss business application examples, and map each of the topics to a business process classification framework. In addition, we provide insights on prospective business applications, challenges, and future research directions. The main contribution of this article is to provide a state-of-the-art overview of current techniques while providing a critical perspective on business applications of social network analysis and mining.

Optimizing the Use of Micro-data: An Overview of the Issues.” Paper presented at the 2006 Joint Statistical Meetings as part of a session organized to honor Pat Doyle. Available at http://client.norc.org/jole/SOLEweb/Accesstomicrodata%5B1%5D.pdf

by Julia Lane , 2005
"... Doyle and Laura Zayatz. Thanks also to Nick Greenia for extensive discussions on data quality and harm issues, Bill Winkler for alerting me to additional data quality and confidentiality literature, Miriam Heller and Sang Kim for their ideas about the relationahip between cyberinfrastructure and con ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Doyle and Laura Zayatz. Thanks also to Nick Greenia for extensive discussions on data quality and harm issues, Bill Winkler for alerting me to additional data quality and confidentiality literature, Miriam Heller and Sang Kim for their ideas about the relationahip between cyberinfrastructure and confidentiality, Nancy Lutz for her help in developing the model and Guy Almes, Fredrik Andersson, Matt Freedman, Cheryl Eavey, Nancy Gordon and Dan Weinberg for their suggestions. “ It is becoming clear that advances in technology and increased use of administrative records may, at some point in the future, render our current disclosure avoidance procedures inadequate. At the same time the … federal statistical system face[s] increasing demands for more, better and more recent data to meet critically important public policy and research needs. ” 2 “The extraordinary growth of electronic infrastructure, capacity, and use in the past decade has posed a profound new set of questions about the control, dissemination, power and use of information. On the one hand the high speed internet and the World Wide Web, email, electronic shopping, and cell phone use have opened up extraordinary new worlds of communication and are changing the way we work, play, and learn. On the other, as the electronic world enters our daily lives, the private space untouched by the intrusions of cyberspace and information seekers shrinks- for individuals, firms, and organizations. …There is also another challenge. The need to build more efficient surveillance networks to combat potential terrorist attack argues for less privacy for the individual person or firm to guarantee the security of the society in general. It is in this environment that citizens, business and technology leaders, and policy makers have to figure out how to understand, manage, and regulate the new cyberworld. ” 3

Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System

by Peter Christen
"... Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the dedupli ..."
Abstract - Add to MetaCart
Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be conducted. Also of increasing interest is the deduplication of a single database. The objectives of record linkage and deduplication are to identify, match and merge all records that relate to the same real-world entities. Because real-world data is commonly ‘dirty’, data cleaning is an important first step in many deduplication, record linkage, and data mining projects. In this paper, an overview of the Febrl (Freely Extensible Biomedical Record Linkage) system is provided, and the results of a recent survey of Febrl users is discussed. Febrl includes a variety of functionalities required for data cleaning, deduplication and record linkage, and it provides a graphical user interface that facilitates its application for users who do not have programming experience.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University