Results 1 - 10
of
16
Efficient top-k query evaluation on probabilistic data
- in ICDE
, 2007
"... Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed ..."
Abstract
-
Cited by 106 (26 self)
- Add to MetaCart
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1
Reference reconciliation in complex information spaces
- In SIGMOD
, 2005
"... Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). W ..."
Abstract
-
Cited by 88 (1 self)
- Add to MetaCart
Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values. A prime example of such a space is Personal Information Management, where the goal is to provide a coherent view of all the information on one’s desktop. Our reconciliation algorithm has three principal features. First, we exploit the associations between references to design new methods for reference comparison. Second, we propagate information between reconciliation decisions to accumulate positive and negative evidences. Third, we gradually enrich references by merging attribute values. Our experiments show that (1) we considerably improve precision and recall over standard methods on a diverse set of personal information datasets, and (2) there are advantages to using our algorithm even on a standard citation dataset benchmark. 1.
A fast linkage detection scheme for multi-source information integration
- in ‘Web Information Retrieval and Integration’ (WIRI’05
, 2005
"... Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous W ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Record linkage refers to techniques for identifying records associated with the same real-world entities. Record linkage is not only crucial in integrating multi-source databases that have been generated independently, but is also considered to be one of the key issues in integrating heterogeneous Web resources. However, when targeting large-scale data, the cost of enumerating all the possible linkages often becomes impracticably high. Based on this background, this paper proposes a fast and efficient method for linkage detection. The features of the proposed approach are: first, it exploits a suffix array structure that enables linkage detection using variable length n-grams. Second, it dynamically generates blocks of possibly associated records using ‘blocking keys ’ extracted from already known reliable linkages. The results from our preliminary experiments where the proposed method was applied to the integration of four bibliographic databases, which scale up to more than 10 million records, are also reported in the paper. 1.
Towards a Benchmark for Instance Matching ⋆
"... Abstract. In the general field of knowledge interoperability and ontology matching, instance matching is a crucial task for several applications, from identity recognition to data integration. The aim of instance matching is to detect instances referred to the same real-world object despite the diff ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. In the general field of knowledge interoperability and ontology matching, instance matching is a crucial task for several applications, from identity recognition to data integration. The aim of instance matching is to detect instances referred to the same real-world object despite the differences among their descriptions. Algorithms and techniques for instance matching have been proposed in literature, however the problem of their evaluation is still open. Furthermore, a widely recognized problem in the Semantic Web in general is the lack of evaluation data. While OAEI (Ontology Alignment Evaluation Initiative) has provided a reference benchmark for concept matching, evaluation data for instance matching are still few. In this paper, we provide a benchmark for instance matching, with the goal of taking into account the main requirements that instance matching algorithms should address. 1
Personal Name Matching: New Test Collections and a Social Network based Approach
"... This paper gives an overview of Personal Name Matching. Personal name matching is of great importance for all applications that deal with personal names. The problem with personal names is that they are not unique and sometimes even for one name many variations exist. This leads to the fact that dat ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper gives an overview of Personal Name Matching. Personal name matching is of great importance for all applications that deal with personal names. The problem with personal names is that they are not unique and sometimes even for one name many variations exist. This leads to the fact that databases on the one hand may have several entries for one and the same person and on the other hand have one entry for many different persons. For the evaluation of Personal Name Matching algorithms test collections are of great importance. Therefore existing test collections are outlined and three new test collections, based on real world bibliographic data, presented. Additionally state-of-the art techniques as well as a new approach based on semantics are described. 1
Scalable Web Data Extraction for Online Market Intelligence
"... Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tas ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tasks are required. On-the-fly data cleansing is necessary in order two identify identical products from different suppliers. It must be possible to smoothly define data flow scenarios that merge and filter streams of extracted data stemming from several Web sites and store the resulting data into a data warehouse, where the data is subjected to market intelligence analytics. Finally, the system must be highly scalable, in order to be able to extract and process massive amounts of data in a short time. Lixto (www.lixto.com), a company offering data extraction tools and services, has been providing OMI solutions for several customers. In this paper we show how Lixto has tackled each of the above challenges by improving and extending its original data extraction software. Most importantly, we show how high scalability is achieved through cloud computing. This paper also features a case study from the computers and electronics market. 1.
Reasoning About Approximate Match Query Results
"... Join techniques deploying approximate match predicates are fundamental data cleaning operations. A variety of predicates have been utilized to quantify approximate match in such operations and some have been embedded in a declarative data cleaning framework. These techniques return pairs of tuples f ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Join techniques deploying approximate match predicates are fundamental data cleaning operations. A variety of predicates have been utilized to quantify approximate match in such operations and some have been embedded in a declarative data cleaning framework. These techniques return pairs of tuples from both relations, tagged with a score, signifying the degree of similarity between the tuples in the pair according to the specific approximate match predicate. In this paper we consider the problem of estimating various parameters on the output of declarative approximate join algorithms for planning purposes. Such algorithms are highly time consuming, so precise knowledge of the result size as well as its score distribution is a pressing concern. This knowledge aids decisions as to which operations are more promising for identifying highly similar tuples which is a key operation for data cleaning. We propose solution strategies that fully comply with a declarative framework and analytically reason about the quality of the estimates we obtain as well as the performance of our strategies. We present the results of a detailed performance evaluation of all strategies proposed. Our experimental results, validate our analytical expectations and shed additional light to the quality and performance of our estimation framework. Our study offers a set of simple, fully declarative techniques for this problem, which are readily deployed in the SPIDER declarative data cleaning system. 1
The HMatch 2.0 Suite for Ontology
"... Abstract. In this paper, we present the HMatch 2.0 suite for a flexible and tailored ontology matchmaking, by focusing on the architectural features and on the evaluation results. Applications of HMatch 2.0 are also discussed, with special regard for the ontology evolution issues in the frame of the ..."
Abstract
- Add to MetaCart
Abstract. In this paper, we present the HMatch 2.0 suite for a flexible and tailored ontology matchmaking, by focusing on the architectural features and on the evaluation results. Applications of HMatch 2.0 are also discussed, with special regard for the ontology evolution issues in the frame of the BOEMIE research project. 1

