Results 1 - 10
of
15
Record Linkage: Current Practice and Future Directions
- CSIRO Mathematical and Information Sciences
, 2003
"... Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabil ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the merge/purge problem. This paper presents the "standard" probabilistic record linkage model and the associated algorithm. Recent work in information retrieval, federated database systems and data mining have proposed alternatives to key components of the standard algorithm. The impact of these alternatives on the standard approach are assessed. The key question is whether and how these new alternatives are better in terms of time, accuracy and degree of automation for a particular record linkage application.
A Parallel Open Source Data Linkage System
, 2004
"... In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage pro ..."
Abstract
-
Cited by 23 (11 self)
- Add to MetaCart
In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identi er, and thus becomes non-trivial. Linking todays large data collections becomes increasingly dicult using traditional linkage techniques.
Preparation of name and address data for record linkage using hidden Markov models Tim Churches
, 2002
"... event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and norma ..."
Abstract
-
Cited by 19 (11 self)
- Add to MetaCart
event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalised in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as "standardisation". This paper describes an alternative approach to standardisation, using a combination of lexicon-based tokenisation and probabilistic hidden Markov models (HMMs).
Probabilistic Name and Address Cleaning and Standardisation
, 2002
"... In the absence of a shared unique key, an ensemble of nonunique personal attributes such as names and addresses is often used to link data from disparate sources. Such data matching is widely used when assembling data warehouses and business mailing lists, and is a foundation of many longitudinal ep ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In the absence of a shared unique key, an ensemble of nonunique personal attributes such as names and addresses is often used to link data from disparate sources. Such data matching is widely used when assembling data warehouses and business mailing lists, and is a foundation of many longitudinal epidemiological and other health related studies. Unfortunately,
Record Linkage Software and Methods for Merging Administrative Lists
- Statistical Research Report Series No. RR/2001/03, Washington DC, US Bureau of the Census 2001
, 2001
"... National Statistical Institutes often have the need to merge administrative files from a variety of sources for which unique identifiers are not available to facilitate matching. Agencies such as Eurostat have the need to connect data sources from different countries and sources and to verify the co ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
National Statistical Institutes often have the need to merge administrative files from a variety of sources for which unique identifiers are not available to facilitate matching. Agencies such as Eurostat have the need to connect data sources from different countries and sources and to verify the confidentiality of microdata. To do this merging of administrative lists, agencies need fast software for cleaning up and standardizing lists and for merging the lists. The U.S. Bureau of the Census has software for name standardization, address standardization, and matching that are considered state-of-the-art. The standardization software breaks names and addresses into components that are easily compared. The matching software accounts for typographical error, automatically estimates matching parameters, and optimizes sets of assignments over large groups of pairs of records.
Quality of Very Large Databases
, 2001
"... Analyses and data mining of large computer files are affected by the quality of the information in the files. For large population registers and for files that are created by merging two or more files, duplicate entries must be identified. Duplicate identification can depend on record linkage softwa ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Analyses and data mining of large computer files are affected by the quality of the information in the files. For large population registers and for files that are created by merging two or more files, duplicate entries must be identified. Duplicate identification can depend on record linkage software that can deal with name, address, and date-of-birth data containing many typographical errors. Quantitative and qualitative data must be edited to assure that mutually contradictory or missing items are changed automatically and quickly. This paper describes computational methods and software that are suitable for groups of files where individual files contain between 1 million and 4 billion records. Keywords: record linkage, editing, imputation, data mining 1.
Disclosure Risk Assessment In Statistical Microdata Protection Via Advanced Record Linkage
, 2003
"... This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage---and thus disclosure---is still possible without shared variables ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper reviews conventional record linkage, which assumes shared variables between the external and the protected data sets, and then shows that record linkage---and thus disclosure---is still possible without shared variables
Automated probabilistic address standardisation and verification
- in ‘Australasian Data Mining Conference’ (AusDM’05
, 2005
"... Abstract. Addresses are a key part of many records containing information about people and organisations, and it is therefore important that accurate address information is available before such data is mined or stored in data warehouses. Unfortunately, addresses are often captured in non-standard a ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. Addresses are a key part of many records containing information about people and organisations, and it is therefore important that accurate address information is available before such data is mined or stored in data warehouses. Unfortunately, addresses are often captured in non-standard and free-text formats, usually with some degree of spelling and typographical errors. Additionally, addresses change over time, for example when people move, when streets are renamed, or when new suburbs are built. Cleaning and standardising addresses, as well as verifying if they really exist, are therefore important steps in data mining pre-processing. In this paper we present an automated probabilistic approach based on a hidden Markov model (HMM), which uses national address guidelines and a comprehensive national address database to clean, standardise and verify raw input addresses. Initial experiments show that our system can correctly standardise even complex and unusual addresses.
Assessing Deduplication and Data Linkage Quality: What to Measure
- In Proc. of the 2005 Australian Conf. on Data Mining
, 2005
"... Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, w ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when deduplicating or linking very large data sets. Different measures have been used to characterise the quality of data linkage algorithms. This paper presents an overview of the issues involved in measuring deduplication and data linkage quality, and it is shown that measures in the space of record pair comparisons can produce deceptive accuracy results. Various measures are discussed and recommendations are given on how to assess deduplication and data linkage quality.
Data Integration and Record Matching: An Austrian Contribution to Research in Official Statistics. Austrian Journal of Statistics
, 2003
"... Abstract: Data integration techniques are one of the core elements of ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract: Data integration techniques are one of the core elements of

