Results 1 - 10
of
22
Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching
, 2000
"... Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a sm ..."
Abstract
-
Cited by 200 (10 self)
- Add to MetaCart
Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once, for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, high-dimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical. Under reasonable assumptions about the cheap distance metric, this reduction in computational cost comes without any loss in clustering accuracy. Canopies can be applied to many domains and used with a variety of clustering approaches, including Greedy Agglomerative Clustering, K-means and Expectation-Maximization. We present experimental results on grouping bibliographic citations from the reference sections of research papers. Here the canopy approach reduces computation time over a traditional clustering approach by more than an order of magnitude and decreases error in comparison to a previously used algorithm by 25%.
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
, 1998
"... Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. ..."
Abstract
-
Cited by 193 (13 self)
- Add to MetaCart
Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIR...
Efficient top-k query evaluation on probabilistic data
- in ICDE
, 2007
"... Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed ..."
Abstract
-
Cited by 106 (26 self)
- Add to MetaCart
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1
Eliminating Fuzzy Duplicates in Data Warehouses
- In VLDB
, 2002
"... The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between m ..."
Abstract
-
Cited by 97 (2 self)
- Add to MetaCart
The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.
Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration
, 2002
"... Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual ..."
Abstract
-
Cited by 96 (6 self)
- Add to MetaCart
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in di#erent databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.
Data Integration Using Similarity Joins and a Word-Based Information Representation Language
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2000
"... ..."
Hardening Soft Information Sources
, 2000
"... The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are "soft": they contain inconsistencies and duplication, and lack unique, consistently-used object identifiers. Examples include ..."
Abstract
-
Cited by 50 (0 self)
- Add to MetaCart
The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases are "soft": they contain inconsistencies and duplication, and lack unique, consistently-used object identifiers. Examples include large bibliographic databases harvested from raw scientific papers or databases constructed by merging heterogeneous "hard" databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global --- many sources of evidence for a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for finding a local optimum. Categories and Subject Descriptors H.4.m [Information Systems]: M...
Robust Identification of Fuzzy Duplicates
- In ICDE
, 2005
"... Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more a ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1.
Learning to Match and Cluster Entity Names
- In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval
, 2001
"... Introduction Information retrieval is, in large part, the study of methods for assessing the similarity of pairs of documents. Document similarity metrics have been used for many tasks including ad hoc document retrieval, text classification [YC1994], and summarization [GC1998,SSMB1997]. Another pro ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
Introduction Information retrieval is, in large part, the study of methods for assessing the similarity of pairs of documents. Document similarity metrics have been used for many tasks including ad hoc document retrieval, text classification [YC1994], and summarization [GC1998,SSMB1997]. Another problem area in which similarity metrics are central is record linkage (e.g., [KA1985]), where one wishes to determine if two database records taken from different source databases refer to the same entity. For instance, one might wish to determine if two database records from two different hospitals, each containing a patient's name, address, and insurance information, refer to the same person; as another example, one might wish to determine if two bibliography records, each containing a paper title, list of authors, and journal name, refer to the same publication. In both of these examples (and in many other practical cases) most of the record fields
WHIRL: A Word-based Information Representation Language
- Artificial Intelligence
, 1999
"... We describe WHIRL, an "information representation language" that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
We describe WHIRL, an "information representation language" that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual similarity, and a "soft" semantics; that is, inferences in WHIRL are associated with numeric scores, and presented to the user in decreasing order by score. This paper briefly describes WHIRL, and then surveys a number of applications. We show that WHIRL strictly generalizes both ranked retrieval of documents, and logical deduction; that non-trivial queries about large databases can be answered eciently; that WHIRL can be used to accurately integrate data from heterogeneous information sources, such as those found on the Web; that WHIRL can be used effectively for inductive classification of text; and nally, that WHIRL can be used to semi-automatically generate extraction programs for structured documents.

