Results 11 - 20
of
41
Optimizing equijoin queries in distributed databases where relations are hash partitioned
- ACM TODS
, 1991
"... Consider the class of distributed database systems consisting of a set of nodes connected by a high bandwidth network. Each node consists of a processor, a random access memory, and a slower but much larger memory such as a disk. There is no shared memory among the nodes. The data are horizontally p ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Consider the class of distributed database systems consisting of a set of nodes connected by a high bandwidth network. Each node consists of a processor, a random access memory, and a slower but much larger memory such as a disk. There is no shared memory among the nodes. The data are horizontally partitioned often using a hash function. Such a description characterizes many parallel or distributed database systems that have recently been proposed, both commercial and academic. We study the optimization problem that arises when the query processor must repartition the relations and intermediate results participating in a multijoin query. Using estimates of the sizes of intermediate relations, we show (1) optimum solutions for closed chain queries; (2) the NP-completeness of the optimization problem for star, tree, and general graph queries; and (3) effective heuristics for these hard cases. Our general approach and many of our results extend to other attribute partitioning schemes, for example, sort-partitioning on attributes, and to partitioned object databases.
Design and Evaluation of Smart Disk Architecture for DSS Commercial Workloads
- in Proceedings of the 2000 International Conference on Parallel Processing
, 2000
"... The requirements for storage space and computational power of largescale applications are increasing rapidly. Clusters seem to be the most attractive architecture for such applications, due to their low costs and high scalability. On the other hand, smart disk systems, with their large storage capac ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The requirements for storage space and computational power of largescale applications are increasing rapidly. Clusters seem to be the most attractive architecture for such applications, due to their low costs and high scalability. On the other hand, smart disk systems, with their large storage capacities and growing computational power are becoming increasingly popular. In this work, we compare the performance of these architectures with a single host-based system using representative queries from the Decision Support System (DSS) databases. We show how to implement individual database operations in the smart disk system and also show how to optimize the execution of the whole query by bundling frequently occurring operations together and executing the bundle in a single invocation. Besides decreasing the overall execution time, operation bundling also offers an easy-to-program and easy-to-use interface to access the data on smart disks. We also present a protocol for minimizing the communication time in the smart disk based system. To measure the response times, we have developed the DBsim, an accurate simulator which can simulate the database operations for the single host-based, cluster-based and smart disk based systems. Using this simulator, we illustrate that the smart disk architecture offers substantial benefits in terms of overall query execution times of the TPC-D benchmark suite. In particular, the average response time of the smart disk architecture for the representative queries from the TPC-D benchmark in our base configuration is 71 % smaller than the response time on the single host-based system and 4:2 % smaller than the response time on the fastest cluster architecture. We also demonstrate the effectiveness of the operation bundling. 1.
Duplicate detection in click streams
- In WWW ’05: Proceedings of the 14th international conference on World Wide Web
, 2005
"... We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both t ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
We consider the problem of finding duplicates in data streams. Duplicate detection in data streams is utilized in various applications including fraud detection. We develop a solution based on Bloom Filters [9], and discuss the space and time requirements for running the proposed algorithm in both the contexts of sliding, and landmark stream windows. We run a comprehensive set of experiments, using both real and synthetic click streams, to evaluate the performance of the proposed solution. The results demonstrate that the proposed solution yields extremely low error rates. 1
Nomenclator Descriptive Query Optimization for Large X.500 Environments
- ACM SIGCOMM Symposium on Communications Architectures and Protocols
, 1991
"... Nomenclator is an architecture for providing efficient descriptive (attribute-based) naming in a large internet environment. As a test of the basic design, we have built a Nomenclator prototype that uses X.500 as its underlying data repository. X.500 SEARCH queries that previously took several minu ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Nomenclator is an architecture for providing efficient descriptive (attribute-based) naming in a large internet environment. As a test of the basic design, we have built a Nomenclator prototype that uses X.500 as its underlying data repository. X.500 SEARCH queries that previously took several minutes, can, in many cases, be answered in a matter of seconds. Our system improves descriptive query performance by trimming branches of the X.500 directory tree from the search. These tree-trimming techniques are part of an active catalog that constrains the search space as needed during query processing. The active catalog provides information about the data distribution (meta-data) to constrain query processing on demand. Nomenclator caches both data (responses to queries) and meta-data (data distribution information, tree-trimming techniques, data access techniques) to speed future queries. Nomenclator relieves users of the need to understand the structure of the name space to locate objec...
Generalized hash teams for join and group-by
- In Proc. of the 25th VLDB Conference
, 1999
"... We propose a new class of algorithms that can be used to speed up the execution of multi-way join queries or of queries that involve one or more joins and a group-by. These new evaluation techniques allow to perform several hash-based operations (join and grouping) in one pass without repartitioning ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We propose a new class of algorithms that can be used to speed up the execution of multi-way join queries or of queries that involve one or more joins and a group-by. These new evaluation techniques allow to perform several hash-based operations (join and grouping) in one pass without repartitioning intermediate results. These techniques work particularly well for joining hierarchical structures, e.g., for evaluating functional join chains along key/foreign-key relationships. The idea is to generalize the concept of hash teams as proposed by Graefe et.al [GBC98] by indirectly partitioning the input data. Indirect partitioning means to partition the input data on an attribute that is not directly needed for the next hash-based operation, and it involves the construction of bitmaps to approximate the partitioning for the attribute that is needed in the next hash-based operation. Our performance experiments show that such generalized hash teams perform significantly better than conventional strategies for many common classes of decision support queries. 1
Descriptive Name Services For Large Internets
, 1993
"... This thesis addresses the challenge of locating people, resources, and other objects in the global Internet. As the Internet grows beyond a million hosts in tens of thousands of organizations, it is increasingly difficult to locate any particular object. Hierarchical name services are frustrating, b ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This thesis addresses the challenge of locating people, resources, and other objects in the global Internet. As the Internet grows beyond a million hosts in tens of thousands of organizations, it is increasingly difficult to locate any particular object. Hierarchical name services are frustrating, because users must guess the unique names for objects or navigate the name space to find information. Descriptive (i.e. relational) name services offer the promise of simple resource location through a non-procedural query language. Users locate resources by describing resource attributes. This thesis makes the promise of descriptive name services real by providing fast query processing in large internets. The key to speed in descriptive query processing is constraining the search space using two new techniques, called an active catalog and meta-data caching. The active catalog constrains the search space for a query by returning a list of data repositories where the answer to the query is li...
PERF Join: An Alternative To Two-way Semijoin And Bloomjoin
- In Proceedings of the International Conference on Information and Knowledge Management (CIKM
, 1995
"... This paper presents "Positionally Encoded Record Filters " (PERFs) and describes their use in a distributed query processing technique called PERF join. A PERF is a novel two-way join reduction implementation primitive. While having the same storage and transmission efficiency as a hash filter (e.g. ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
This paper presents "Positionally Encoded Record Filters " (PERFs) and describes their use in a distributed query processing technique called PERF join. A PERF is a novel two-way join reduction implementation primitive. While having the same storage and transmission efficiency as a hash filter (e.g., Bloom Filter), a PERF is based on the relation tuple scan order instead of hashing. Hence it doesn't suffer any loss of join information incurred by hash collisions. Using the query response time measured in terms of network cost as a comparison criterion, we demonstrate through analytical studies that PERF join performs significantly better than two-way Bloomjoin and two-way semijoin variants under a wide range of relevant cost parameter values. For the large number of distributed query processing algorithms relying on Bloomjoin or semijoin variants to reduce their network cost, we can sometimes gain an instant improvement in their response time by switching to PERF join instead. Other sa...
Autonomous Disks for Advanced Database Applications
- In Proc. of International Symposium on Database Applications in Non-Traditional Environments (DANTE’99
, 1999
"... The scalability and reliability of secondary storage systems are their most significant aspects for advanced database applications. Research on high-function disks has recently attracted a great deal of attention because technological progress now allows disk-resident data processing. This capabilit ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
The scalability and reliability of secondary storage systems are their most significant aspects for advanced database applications. Research on high-function disks has recently attracted a great deal of attention because technological progress now allows disk-resident data processing. This capability is not only useful for executing application programs on the disk, but is also suited for controlling distributed disks so they are scalable and reliable. In this paper, we propose autonomous disks in the network environment by using the disk-resident data processing facility. A set of autonomous disks is configured as a cluster in a network, and data is distributed within the cluster, to be accessed uniformly by using a distributed directory. The disks accept simultaneous accesses from multiple hosts via a network, and handle data distribution and load skews. They are also able to tolerate disk failures and some software errors of disk controllers, and can reconfigure the cluster after th...
Improving distributed join efficiency with extended bloom filter operations
- IN 21ST INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA-07). IEEE COMPUTER SOCIETY
, 2007
"... Bloom filter based algorithms have proven successful as
very efficient technique to reduce communication costs of
database joins in a distributed setting. However, the full
potential of bloom filters has not yet been exploited. Especially in the case of multi-joins, where the data is distributed amo ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Bloom filter based algorithms have proven successful as
very efficient technique to reduce communication costs of
database joins in a distributed setting. However, the full
potential of bloom filters has not yet been exploited. Especially in the case of multi-joins, where the data is distributed among several sites, additional optimization opportunities arise, which require new bloom filter operations and computations. In this paper, we present these extensions and point out how they improve the performance of such distributed joins. While the paper focuses on efficient join computation, the described extensions are applicable to a wide range of usages, where bloom filters are facilitated for compressed set representation.
On Line Processing of Compacted Relations
- In 8th Int. Conference on Very Large Data Bases (VLDB
, 1982
"... Most data base machines use some kind of "filter" that performs unary relational operators (selec-tion and projection) on relations Cl to 71. These filters operate "on the fly " that is, at the speed of the disk, while the relation is being transferred into main memory, Processin ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Most data base machines use some kind of "filter" that performs unary relational operators (selec-tion and projection) on relations Cl to 71. These filters operate "on the fly " that is, at the speed of the disk, while the relation is being transferred into main memory, Processing time being proportional to relation size, it is therefore important to represent data in the most compacted way. In this paper we address the problem of satisfying the two seemingly contra-dictory requirements: i) finding an "optimal " compaction scheme ii) processing optimally compacted relations on

