Results 1 - 10
of
11
CMD: A Multidimensional Declustering Method for Parallel Database Systems
- In Proceedings of the Int. Conf. on Very Large Data Bases
, 1992
"... I/O parallelism appears to be a promising approach to achieving high performance in parallel database systems. In such systems, it is essential to decluster database files into fragments and spread them across multiple disks so that the DBMS software can exploit the I/O bandwidth reading and writing ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
I/O parallelism appears to be a promising approach to achieving high performance in parallel database systems. In such systems, it is essential to decluster database files into fragments and spread them across multiple disks so that the DBMS software can exploit the I/O bandwidth reading and writing the disks in parallel. In this paper, we consider the problem of declustering multidimensional data on a parallel disk system. Since the multidimensional range query is the main work-horse for applications accessing such data, our aim is to provide efficient support for it. A new declustering method for parallel disk systems, called coordinate modulo distribution (CMD), is proposed. Our analysis shows that the method achieves optimum parallelism for a very high percentage of range queries on multidimensional data, if the distribution of data on each dimension is stationary. We have derived the exact conditions under which optimality is achieved. Also provided are the worst and average case bounds ...
Declustering Spatial Databases on a Multi-Computer Architecture
, 1996
"... . We present a technique to decluster a spatial access method on a shared-nothing multi-computer architecture [DGS + 90]. We propose a software architecture with the R-tree as the underlying spatial access method, with its non-leaf levels on the `master-server' and its leaf nodes distributed acros ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
. We present a technique to decluster a spatial access method on a shared-nothing multi-computer architecture [DGS + 90]. We propose a software architecture with the R-tree as the underlying spatial access method, with its non-leaf levels on the `master-server' and its leaf nodes distributed across the servers. The major contribution of our work is the study of the optimal capacity of leaf nodes, or `chunk size' (or `striping unit'): we express the response time on range queries as a function of the `chunk size', and we show how to optimize it. We implemented our method on a network of workstations, using a real dataset, and we compared the experimental and the theoretical results. The conclusion is that our formula for the response time is very accurate (the maximum relative error was 29%; the typical error was in the vicinity of 10-15%). We illustrate one of the possible ways to exploit such an accurate formula, by examining several `what-if' scenarios. One major, practical conclus...
Applications of Combinatorial Designs to Communications, Cryptography, and Networking
, 1999
"... ... In this paper, we focus on another collection of recent applications in the general area of communications, including cryptography and networking. Applications have been chosen to represent those in which design theory plays a useful, and sometimes central, role. Moreover, applications have been ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
... In this paper, we focus on another collection of recent applications in the general area of communications, including cryptography and networking. Applications have been chosen to represent those in which design theory plays a useful, and sometimes central, role. Moreover, applications have been chosen to reflect in addition the genesis of new and interesting problems in design theory in order to treat the practical concerns. Of many candidates, thirteen applications areas have been included. They are as follows:
Efficient Retrieval of Multidimensional Datasets Through Parallel I/O
, 1998
"... Many scientific and engineering applications process large multidimensional datasets. An important access pattern for these applications is the retrieval of data corresponding to ranges of values in multiple dimensions. Performance is limited by disks largely due to high disk latencies. Tiling and d ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Many scientific and engineering applications process large multidimensional datasets. An important access pattern for these applications is the retrieval of data corresponding to ranges of values in multiple dimensions. Performance is limited by disks largely due to high disk latencies. Tiling and distributing the data across multiple disks is an effective technique for improving performance through parallel I/O. The distribution of tiles across the disks is an important factor in achieving gains. Several schemes for declustering multidimensional data to improve the performance of range queries have been proposed in the literature. We extend the class of Cyclic schemes which have been developed earlier for two-dimensional data to multiple dimensions. We establish important properties of Cyclic schemes, based upon which we reduce the search space for determining good declustering schemes within the class of Cyclic schemes. Through experimental evaluation, we establish that the Cyclic sc...
Physical Database Design Decision Algorithms and Concurrent Reorganization for Parallel Database Systems
, 1997
"... Stringent performance requirements in DB applications have led to the use of parallelism for database processing. To allow the database system to take advantage of the performance of parallel shared-nothing systems, the physical DB design must be appropriate for the DB structure and the workload. We ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Stringent performance requirements in DB applications have led to the use of parallelism for database processing. To allow the database system to take advantage of the performance of parallel shared-nothing systems, the physical DB design must be appropriate for the DB structure and the workload. We develop decision algorithms that will select a good physical DB design both when the DB is first loaded into the system (static decision) and while the DB is being used by the workload (dynamic decision). Our decision algorithms take the database structure, workload, and system characteristics as inputs. The static (or initial) physical DB design decision algorithm involves: • selecting a partitioning attribute for each relation that determines how the relation is fragmented across the nodes (allowing for high I/O bandwidth); • selecting indexes on the relation attributes to allow faster accesses compared to sequential file scans; • selecting the attributes by which to cluster a relation in order to take advantage of the prefetching and caching involved in I/O access; • grouping of relations to allow DB operations (joins) on relation pairs to be executed locally
Partitioning Key Selection for a Shared-Nothing Parallel Database System
- IBM Research Report RC
, 1994
"... A shared nothing database system which tries to leverage the knowledge of partitioning attributes of relations can outperform a system where such knowledge is either not available or not used. The performance improvements are typically obtained by function shipping more database operations (joins, a ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
A shared nothing database system which tries to leverage the knowledge of partitioning attributes of relations can outperform a system where such knowledge is either not available or not used. The performance improvements are typically obtained by function shipping more database operations (joins, aggregates etc.), thus minimizing the communication overhead. In such a system, it is critical that the correct partitioning keys are selected so that the query workload is optimized. Previous research has ignored the importance of selecting the partitioning keys and have mostly focused on the degree of declustering. In this study we show that by following a systematic methodology, especially for the partitioning key selection and associated relation grouping issues, the entire data placement strategy for a given database schema and workload can be determined in a very efficient manner. We describe different flavors of this methodology and demonstrate the performance improvements resulting fr...
Efficient parallel processing of range queries through replicated declustering
- JOURNAL OF DISTRIBUTED AND PARALLEL DATABASES
"... A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, an ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, and the most common type of queries on such data, i.e., range queries. An optimal declustering scheme is one in which the processing for all range queries is balanced uniformly among the available disks. It has been shown that single copy based declustering schemes are non-optimal for range queries. In this paper, we integrate replication in conjunction with parallel disk declustering for efficient processing of range queries. We note that replication is largely used in database applications for several purposes like load balancing, fault tolerance and availability of data. We propose theoretical foundations for replicated declustering and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal for a number of disks. We propose a framework for replicated declustering, using a limited amount of replication and provide extensions to apply it on real data, which include arbitrary grids and a large number of disks. Our framework also provides an effective indexing scheme that enables fast identification of data of interest in parallel servers. In addition to optimal processing of single queries, we show that this framework is effective for parallel processing of multiple queries. We present experimental results comparing the proposed replication scheme to other techniques for both single queries and multiple queries, on synthetic and real data sets.
Efficient disk allocation for fast similarity searching
- IN PROC. OF THE 10TH INT. SYM. ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1998
"... As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for improving the performance of similarity searches through parallel I/O. In this paper ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for improving the performance of similarity searches through parallel I/O. In this paper, we propose a new scheme which provides good declustering for similarity searching. In particular, it does global declustering as opposed to local declustering, exploits the availability of extra disks and does not limit the partitioning of the data space. Our technique is based upon the cyclic declustering schemes which were developed for range and partial match queries. We establish, in general, that cyclic declustering techniques outperform previously proposed techniques.
Efficient retrieval of replicated data
, 2006
"... Declustering is a common technique used to reduce query response times. Data is declustered over multiple disks and query retrieval can be parallelized. Most of the research on declustering is targeted at spatial range queries and investigates schemes with low additive error. Recently, declusterin ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Declustering is a common technique used to reduce query response times. Data is declustered over multiple disks and query retrieval can be parallelized. Most of the research on declustering is targeted at spatial range queries and investigates schemes with low additive error. Recently, declustering using replication has been proposed to reduce the additive overhead. Replication significantly reduces retrieval cost of arbitrary queries. In this paper, we propose a disk allocation and retrieval mechanism for arbitrary queries based on design theory. Using the proposed c-copy replicated declustering scheme, (c − 1)k 2 + ck buckets can be retrieved using at most k disk accesses. Retrieval algorithm is very efficient and is asymptotically optimal with �(|Q|) complexity for a query Q. In addition to the deterministic worst-case bound and efficient retrieval, proposed algorithm handles nonuniform data, high dimensions, supports incremental declustering and has good faulttolerance property. Experimental results show the feasibility of the algorithm.
Analysis and comparison of replicated declustering schemes
- IEEE Transactions on Parallel and Distributed Systems
, 2007
"... Abstract—Declustering distributes data among parallel disks to reduce the retrieval cost using I/O parallelism. Many schemes were proposed for the single-copy declustering of spatial data. Recently, declustering using replication gained a lot of interest and several schemes with different properties ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Declustering distributes data among parallel disks to reduce the retrieval cost using I/O parallelism. Many schemes were proposed for the single-copy declustering of spatial data. Recently, declustering using replication gained a lot of interest and several schemes with different properties were proposed. An in-depth comparison of major schemes is necessary to understand replicated declustering better. In this paper, we analyze the proposed schemes, tune some of the parameters, and compare them for different query types and under different loads. We propose a three-step retrieval algorithm for the compared schemes. For arbitrary queries, the dependent and partitioned allocation schemes perform poorly; others perform close to each other. For range queries, they perform similarly with the exception of smaller queries in which random duplicate allocation (RDA) performs poorly and dependent allocation performs well. For connected queries, partitioned allocation performs poorly and dependent allocation performs well under a light load. Index Terms—Declustering, parallel I/O, spatial range query, Latin square. 1

