Results 1 - 10
of
21
The Merge/Purge Problem for Large Databases
- In Proceedings of the 1995 ACM SIGMOD
, 1995
"... Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsiste ..."
Abstract
-
Cited by 254 (3 self)
- Add to MetaCart
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Clos...
Trust Management for the Semantic Web
- IN PROCEEDINGS OF THE SECOND INTERNATIONAL SEMANTIC WEB CONFERENCE
, 2003
"... Though research on the Semantic Web has progressed at a steady pace, its promise has yet to be realized. One major difficulty is that, by its very nature, the Semantic Web is a large, uncensored system to which anyone may contribute. This raises ..."
Abstract
-
Cited by 152 (3 self)
- Add to MetaCart
Though research on the Semantic Web has progressed at a steady pace, its promise has yet to be realized. One major difficulty is that, by its very nature, the Semantic Web is a large, uncensored system to which anyone may contribute. This raises
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
- DATA MINING AND KNOWLEDGE DISCOVERY
, 1998
"... The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and ..."
Abstract
-
Cited by 151 (0 self)
- Add to MetaCart
The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive c...
Hierarchical encoded path views for path query processing: An optimal model and its performance evaluation
- IEEE Transactions on Knowledge and Data Engineering
, 1998
"... Abstract—Efficient path computation is essential for applications such as intelligent transportation systems (ITS) and network routing. In ITS navigation systems, many path requests can be submitted over the same, typically huge, transportation network within a small time window. While path precompu ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
Abstract—Efficient path computation is essential for applications such as intelligent transportation systems (ITS) and network routing. In ITS navigation systems, many path requests can be submitted over the same, typically huge, transportation network within a small time window. While path precomputation (path view) would provide an efficient path query response, it raises three problems which must be addressed: 1) precomputed paths exceed the current computer main memory capacity for large networks; 2) disk-based solutions are too inefficient to meet the stringent requirements of these target applications; and 3) path views become too costly to update for large graphs (resulting in out-of-date query results). We propose a hierarchical encoded path view (HEPV) model that addresses all three problems. By hierarchically encoding partial paths, HEPV reduces the view encoding time, updating time and storage requirements beyond previously known path precomputation techniques, while significantly minimizing path retrieval time. We prove that paths retrieved over HEPV are optimal. We present complete solutions for all phases of the HEPV approach, including graph partitioning, hierarchy generation, path view encoding and updating, and path retrieval. In this paper, we also present an in-depth experimental evaluation of HEPV based on both synthetic and real GIS networks. Our results confirm that HEPV offers advantages over alternative path finding approaches in terms of performance and space efficiency. Index Terms—Path queries, path view materialization, hierarchical path search, GIS databases, graph partitioning. 1
Diagnosis of Asynchronous Discrete Event Systems: Datalog to the Rescue!
- IN ACM PODS
, 2005
"... We consider query optimization techniques for data intensive P2P applications. We show how to adapt an old technique from deductive databases, namely Query-Sub-Query (QSQ), to a setting where autonomous and distributed peers share large volumes of interelated data. We illustrate the technique with a ..."
Abstract
-
Cited by 23 (6 self)
- Add to MetaCart
We consider query optimization techniques for data intensive P2P applications. We show how to adapt an old technique from deductive databases, namely Query-Sub-Query (QSQ), to a setting where autonomous and distributed peers share large volumes of interelated data. We illustrate the technique with an important telecommunication problem, the diagnosis of distributed telecom systems. We show that (i) the problem can be modeled using Datalog programs, and (ii) it can benefit from the large battery of optimization techniques developed for Datalog. In particular, we show that a simple generic use of the extension of QSQ achieves an optimization as good as that previously provided by dedicated diagnosis algorithms. Furthermore, we show that it allows solving efficiently a much larger class of system analysis problems.
Hierarchical Optimization of Optimal Path Finding for Transportation Applications
- In Proc. Of ACM Conference on Information and Knowledge Management
, 1996
"... Efficient path query processing is a key requirement for advanced database applications including GIS (Geographic Information Systems) and ITS (Intelligent Transportation Systems). We study the problem in the context of automobile navigation systems where a large number of path requests can be submi ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Efficient path query processing is a key requirement for advanced database applications including GIS (Geographic Information Systems) and ITS (Intelligent Transportation Systems). We study the problem in the context of automobile navigation systems where a large number of path requests can be submitted over the transportation network within a short period of time. To guarantee efficient responsefor path queries, we employa path view materialization strategy for precomputing the best paths. We tackle the following three issues: (1) memory-resident solutions quickly exceed current computer storage capacity for networks of thousands of nodes, (2) diskbased solutions have been found inefficient to meet the stringent performance requirements, and (3) path views become too costly to update for large graphs. We propose the HEPV (Hierarchical Encoded Path View) approach that addresses these problems while guaranteeing the optimality of path retrieval. Our experimental results reveal that HEPV...
A survey of parallel execution strategies for transitive closure and logic programs
- DISTRIBUTED AND PARALLEL DATABASES
, 1993
"... An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In particu ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In particular, hash-based fragmentation is used to distribute data to disks under the control of different processors in order to perform selections and joins in parallel. With the development of new query languages, and in particular with the definition of transitive closure queries and of more general logic programming queries, the new dimension of recursion has been added to query processing. Recursive queries are complex; at the same time, their regular structure is particularly suited for parallel execution, and parallelism may give a high efficiency gain. We survey the approaches to parallel execution of recursive queries that have been presented in the recent literature. We observe that research on parallel execution of recursive queries is separated into two distinct subareas, one focused on the transitive closure of Relational Algebra expressions, the other one focused on optimization of more general Datalog queries. Though the subareas seem radically different because of the approach and formalism used, they have many common features. This is not surprising, because most typical Datalog queries can be solved by means of the transitive closure of simple
Distributed Transitive Closure Computations: The Disconnection Set Approach
- in Proc. 16th Int'l Conf. on VLDB
, 1990
"... This paper deals with one of the most common and important types of recursion: transitive closure. Since many real world problems reduce to generalized transitive closure computations, efficient computation is essential. To gain a significant speedup in processing, we consider distributed (i.e. para ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
This paper deals with one of the most common and important types of recursion: transitive closure. Since many real world problems reduce to generalized transitive closure computations, efficient computation is essential. To gain a significant speedup in processing, we consider distributed (i.e. parallel) computation. Partial support from NFI, a Dutch research fund, and from the LOGIDATA+ project of C.N.R Italy y Department of Applied Mathematics, University of Twente, P.O. Box 217, 7500 AE Enschede, the Netherlands z Computer Science Department, University of Twente x Dipartimento di Matematica,Universita' di Modena By fragmenting the data beforehand according to rules stemming from the application domain, queries can be split into several independent subqueries. These subqueries are computed in parallel on only a part of the data and are more specialized in the sense that extra selections are applied on each fragment. The disconnection set approach introduced in this paper...
Implementation and performance evaluation of a parallel transitive closure algorithm on PRISMA/DB
, 1993
"... This paper describes an experimental performance study of the parallel computation of transitive closure operations on a parallel database system. This work brings two research efforts together. The first is the development of an efficient execution strategy for the parallel computation of path prob ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper describes an experimental performance study of the parallel computation of transitive closure operations on a parallel database system. This work brings two research efforts together. The first is the development of an efficient execution strategy for the parallel computation of path problems, called the Disconnection Set Approach. The second is the development and implementation of a parallel, main-memory DBMS, called PRISMA/DB. Here, we report on the implementation of the disconnection set approach on PRISMA/DB, showing how the latter's design allowed us to easily extend the functionality of the system. It is shown that the parallel implementation of the disconnection set approach yields good performance characteristics, and that linear speedup with respect to a special purpose single processor algorithm is achieved. Finally, we describe a number of experiments that show to what extent data fragmentation issues influence the performance of the disconnection set approach. 1...
Data Fragmentation for Parallel Transitive Closure Strategies
- In Proceedings of the IEEE 9th International Conference on Data Engineering
, 1993
"... A topic that is currently inspiring a lot of research is parallel (distributed) computation of transitive closure queries. In [10] the disconnection set approach has been introduced as an effective strategy for such a computation. It involves reformulating a transitive closure query on a relation in ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
A topic that is currently inspiring a lot of research is parallel (distributed) computation of transitive closure queries. In [10] the disconnection set approach has been introduced as an effective strategy for such a computation. It involves reformulating a transitive closure query on a relation into a number of transitive closure queries on smaller fragments; these queries can then execute independently on the fragments, without need for communication and without computing the same tuples at more than one processor. Now that effective strategies as just mentioned have been developed, the next problem is that of developing adequate data fragmentation strategies for these approaches. This is a difficult problem, but of paramount importance to the success of these approaches. We discuss the issues that influence data fragmentation. We present a number of algorithms, each focusing on one of the important issues. We discuss the pros and cons of the algorithms, and we give some results of ...

