Results 1 - 10
of
99
Trio: a system for integrated management of data, accuracy, and lineage
- PRESENTED AT CIDR 2005
, 2005
"... Trio is a new database system that manages not only data, butalsotheaccuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio proj ..."
Abstract
-
Cited by 174 (11 self)
- Add to MetaCart
Trio is a new database system that manages not only data, butalsotheaccuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio project are to combine and distill previous work into a simple and usable model, design a query language as an understandable extension to SQL, and most importantly build a working system—a system that augments conventional data management with both accuracy and lineage as an integral part of the data. This paper provides numerous motivating applications for Trio and lays out preliminary plans for the data model, query language, and prototype system.
Efficient top-k query evaluation on probabilistic data
- in ICDE
, 2007
"... Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed ..."
Abstract
-
Cited by 106 (26 self)
- Add to MetaCart
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1
Exploiting relationships for domain-independent data cleaning
, 2005
"... In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each ..."
Abstract
-
Cited by 59 (15 self)
- Add to MetaCart
In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real datasets and also over synthetic datasets show that analysis of relationships significantly improves quality of the result.
Collective entity resolution in relational data
- ACM Transactions on Knowledge Discovery from Data (TKDD
, 2006
"... Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query proces ..."
Abstract
-
Cited by 56 (7 self)
- Add to MetaCart
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.
A Latent Dirichlet Model for Unsupervised Entity Resolution
- SIAM INTERNATIONAL CONFERENCE ON DATA MINING
, 2006
"... Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.
A Platform for Personal Information Management and Integration
"... The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktop ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focused on the WWW, individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search tools. We describe the Semex System that offers users a flexible platform for personal information management. Semex has two main goals. The first goal is to enable browsing personal information by semantically meaningful associations. The challenge it to automatically create such associations between data items on one’s desktop, and to create enough of them so Semex becomes an indispensable tool. Our second goal is to leverage the personal information space we created to increase users ’ productivity. As our first target, Semex leverages the personal information to enable lightweight information integration tasks that are discouragingly difficult to perform with today’s tools.
10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
, 2006
"... Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this pa ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decomposition-based approach to addressing this problem. We introduce world-set decompositions (WSDs), a space-efficient formalism for representing any finite set of possible worlds over relational databases. WSDs are therefore a strong representation system for any relational query language. We study the problem of efficiently evaluating relational algebra queries on sets of worlds represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
Robust Identification of Fuzzy Duplicates
- In ICDE
, 2005
"... Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more a ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach
, 2004
"... To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To ta ..."
Abstract
-
Cited by 41 (12 self)
- Add to MetaCart
To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this "deep Web," query interfaces generally form complex matchings between attribute groups (e.g., corresponds to name, last name} in the Books domain). We observe that the cooccurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. Unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. This leads to the introduction of a new correlation measure, H-measure, distinct from those proposed in previous work. We evaluate our approach extensively and the results show good accuracy for discovering complex matchings.
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.

