Results 1 -
6 of
6
Interactive Deduplication using Active Learning
, 2002
"... Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to ov ..."
Abstract
-
Cited by 242 (5 self)
- Add to MetaCart
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists.
We present our design of a learning-based deduplication
system that uses a novel method of interactively discovering
challenging training pairs using active learning. Our
experiments on real-life datasets show that active learning
signi#12;cantly reduces the number of instances needed to
achieve high accuracy. We investigate various design issues
that arise in building a system to provide interactive
response, fast convergence, and interpretable output.
Identifying and Merging Related Bibliographic Records
- MIT LCS Masters Thesis
, 1996
"... Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multi ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
(Show Context)
Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on aut...
PUB TYPE Reports Descriptive (141)-- Speeches/Meeting Papers (150)
"... This paper is based on the results of the study of the Work Group of Bibliographic Standards for the Greek union catalog, the first stage of Greek academic library union catalog development. The first section lists the objectives of the union catalog. The state of the art of Greek academic libraries ..."
Abstract
- Add to MetaCart
This paper is based on the results of the study of the Work Group of Bibliographic Standards for the Greek union catalog, the first stage of Greek academic library union catalog development. The first section lists the objectives of the union catalog. The state of the art of Greek academic libraries is discussed in the second section. The lack of uniformity is identified as the main difficulty in setting up the union catalog. The next section addresses implementation models, and the fourth section describes two implementation phases (i.e., formation/homogeneity of the primary database and function/updating of the union catalog). Specifications required for the union catalog system are summarized in the fifth section, including records format, quality control of records, multiple records identification, and the data model. The sixth section considers standardization, including bibliographic standards, authorization of names and subjects, holdings information, and interlibrary loan. The importance of education and training
Identifying and Merging Related Bibliographic Records
, 1996
"... Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multi ..."
Abstract
- Add to MetaCart
Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet.
Duplicate
, 2007
"... The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm ..."
Abstract
- Add to MetaCart
(Show Context)
The current issue and full text archive of this journal is available at www.emeraldinsight.com/0737-8831.htm