Identifying and Merging Related Bibliographic Records (1996)
| Venue: | MIT LCS Masters Thesis |
| Citations: | 35 - 0 self |
BibTeX
@TECHREPORT{Hylton96identifyingand,
author = {Jeremy A. Hylton and Jeremy A. Hylton},
title = {Identifying and Merging Related Bibliographic Records},
institution = {MIT LCS Masters Thesis},
year = {1996}
}
Years of Citing Articles
OpenURL
Abstract
Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on aut...







