An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records (2000)
| Citations: | 2 - 1 self |
BibTeX
@MISC{Monge00anadaptive,
author = {Alvaro E. Monge},
title = {An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records},
year = {2000}
}
OpenURL
Abstract
The integration of information is an important area of research in databases. By combining multiple information sources, a more complete and more accurate view of the world is attained, and additional knowledge gained. This is a non-trivial task however. Often there are many sources which contain information about a certain kind of entity, and some will contain records concerning the same real-world entity. Furthermore, one source may not have the exact information that another source contains. Some of the information may be different due to data entry errors for example or may be missing altogether. Thus, one problem in integrating information sources is to identify possibly different designators of the same entity. Data cleansing is the process of purging databases of inaccurate or inconsistent data. The data is typically manipulated into a form which is useful for other tasks, such as data mining. This paper addresses the data cleansing problem of detecting database records that are approximate duplicates, but not exact duplicates. An efficient algorithm is presented which combines three key ideas. First, the Smith-Waterman algorithm for computing the minimum edit-distance is used as a domain-independent method to recognize pairs of approximately duplicates.







