Results 1 -
8 of
8
Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching
, 2000
"... Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a sm ..."
Abstract
-
Cited by 200 (10 self)
- Add to MetaCart
Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once, for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, high-dimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of clustering approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present experimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clustering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.
Eliminating Fuzzy Duplicates in Data Warehouses
- In VLDB
, 2002
"... The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between m ..."
Abstract
-
Cited by 97 (2 self)
- Add to MetaCart
The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.
Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration
, 2002
"... Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual ..."
Abstract
-
Cited by 96 (6 self)
- Add to MetaCart
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in di#erent databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.
Data Integration Using Similarity Joins and a Word-Based Information Representation Language
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2000
"... ..."
Hardening Soft Information Sources
, 2000
"... The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are "soft": they contain inconsistencies and duplication, and lack unique, consistently-used object identifiers. Examples include ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are "soft": they contain inconsistencies and duplication, and lack unique, consistently-used object identifiers. Examples include large bibliographic databases harvested from raw scientific papers or databases constructed by merging heterogeneous "hard" databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global --- many sources of evidence for a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for finding a local optimum. Categories and Subject Descriptors H.4.m [Information Systems]: M...
Robust Identification of Fuzzy Duplicates
- In ICDE
, 2005
"... Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more a ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1.
WHIRL: A Word-based Information Representation Language
- Artificial Intelligence
, 1999
"... We describe WHIRL, an "information representation language" that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
We describe WHIRL, an "information representation language" that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual similarity, and a "soft" semantics; that is, inferences in WHIRL are associated with numeric scores, and presented to the user in decreasing order by score. This paper briefly describes WHIRL, and then surveys a number of applications. We show that WHIRL strictly generalizes both ranked retrieval of documents, and logical deduction; that non-trivial queries about large databases can be answered eciently; that WHIRL can be used to accurately integrate data from heterogeneous information sources, such as those found on the Web; that WHIRL can be used effectively for inductive classification of text; and nally, that WHIRL can be used to semi-automatically generate extraction programs for structured documents.
String Edit Analysis for Merging Databases
- In KDD 2000 Workshop on Text Mining
, 2000
"... The first step prior to data mining is often to merge databases from different sources. Entries in these databases or descriptions retrieved using information extraction. may use significantly different vocabularies, so one often needs to determine whether similar descriptions refer to the same item ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The first step prior to data mining is often to merge databases from different sources. Entries in these databases or descriptions retrieved using information extraction. may use significantly different vocabularies, so one often needs to determine whether similar descriptions refer to the same item or to different items (e.g., people or goods). String edit distance is an elegant way of defining the degree of similarity between entries and can be efficiently computed using dynamic programming (Ristad and Yianilos, 1977). However, in order to achieve reasonable accuracy, most real problems require the use of extended sets of edit rules with associated costs that are tuned specifically to each data set. We present a flexible approach to string edit distance, which can be automatically tuned to different data sets and can use synonym dictionaries. Dynamic programming is used to calculate the edit distance between a pair of strings based on a set of string edit rules including a new edit rule that allows words and phrases to be deleted or substituted. A genetic algorithm is used to learn costs corresponding to each edit rule based on a small set of labeled training data. Deleting contentless words like "method " and substituting synonyms such as "ibuprofen " for "Motrin " significantly increases the algorithm’s accuracy (from 80 % to 90 % on a difficult sample medical data set), when costs are correctly tuned. This string edit-based matching tool is easily adapted for a variety of different cases when one needs to recognize which text strings from different information sources refer to the same item such as a person, address, medical procedure or product.

