Results 1 - 10
of
82
Parallel database systems: the future of high performance database systems
- Communications of the ACM
, 1992
"... Abstract: Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper ..."
Abstract
-
Cited by 466 (8 self)
- Add to MetaCart
Abstract: Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper reviews the techniques used by such systems, and surveys current commercial and research systems. 1.
Bigtable: A distributed storage system for structured data
- IN PROCEEDINGS OF THE 7TH CONFERENCE ON USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION - VOLUME 7
, 2006
"... Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications ..."
Abstract
-
Cited by 285 (3 self)
- Add to MetaCart
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable. 1
RAID: High-Performance, Reliable Secondary Storage
- ACM COMPUTING SURVEYS
, 1994
"... Disk arrays were proposed in the 1980s as a way to use parallelism between multiple disks to improve aggregate I/O performance. Today they appear in the product lines of most major computer manufacturers. This paper gives a comprehensive overview of disk arrays and provides a framework in which to o ..."
Abstract
-
Cited by 282 (6 self)
- Add to MetaCart
Disk arrays were proposed in the 1980s as a way to use parallelism between multiple disks to improve aggregate I/O performance. Today they appear in the product lines of most major computer manufacturers. This paper gives a comprehensive overview of disk arrays and provides a framework in which to organize current and future work. The paper first introduces disk technology and reviews the driving forces that have popularized disk arrays: performance and reliability. It then discusses the two architectural techniques used in disk arrays: striping across multiple disks to improve performance and redundancy to improve reliability. Next, the paper describes seven disk array architectures, called RAID (Redundant Arrays of Inexpensive Disks) levels 0-6 and compares their performance, cost, and reliability. It goes on to discuss advanced research and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency. Last, the paper describes six disk array prototypes or products and discusses future opportunities for research. The paper includes an annotated bibliography of disk array-related literature.
The Gamma database machine project
- IEEE Transactions on Knowledge and Data Engineering
, 1990
"... This paper describes the design of the Gamma database machine and the techniques employed in its implementation. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the arc ..."
Abstract
-
Cited by 203 (27 self)
- Add to MetaCart
This paper describes the design of the Gamma database machine and the techniques employed in its implementation. Gamma is a relational database machine currently operating on an Intel iPSC/2 hypercube with 32 processors and 32 disk drives. Gamma employs three key technical ideas which enable the architecture to be scaled to 100s of processors. First, all relations are horizontally partitioned across multiple disk drives enabling relations to be scanned in parallel. Second, novel parallel algorithms based on hashing are used to implement the complex relational operators such as join and aggregate functions. Third, dataflow scheduling techniques are used to coordinate multioperator queries. By using these techniques it is possible to control the execution of very complex queries with minimal coordination- a necessity for configurations involving a very large number of processors. In addition to describing the design of the Gamma software, a thorough performance evaluation of the iPSC/2 hypercube version of Gamma is also presented. In addition to measuring the effect of relation size and indices on the response time for selection, join, aggregation, and update queries, we also analyze the performance of Gamma relative to the number of processors employed when the sizes of the input relations are kept constant (speedup) and when the sizes of the input relations are increased proportionally to the number of processors (scaleup). The speedup results obtained for both selection and join queries are linear; thus, doubling the number of processors
The state of the art in distributed query processing
- ACM Computing Surveys
, 2000
"... Distributed data processing is fast becoming a reality. Businesses want to have it for many reasons, and they often must have it in order to stay competitive. While much of the infrastructure for distributed data processing is already in place (e.g., modern network technology), there are a number of ..."
Abstract
-
Cited by 182 (2 self)
- Add to MetaCart
Distributed data processing is fast becoming a reality. Businesses want to have it for many reasons, and they often must have it in order to stay competitive. While much of the infrastructure for distributed data processing is already in place (e.g., modern network technology), there are a number of issues which still make distributed data processing a complex undertaking: (1) distributed systems can become very large involving thousands of heterogeneous sites including PCs and mainframe server machines � (2) the state of a distributed system changes rapidly because the load of sites varies over time and new sites are added to the system� (3) legacy systems need to be integrated|such legacy systems usually have not been designed for distributed data processing and now need to interact with other (modern) systems in a distributed environment. This paper presents the state of the art of query processing for distributed database and information systems. The paper presents the \textbook " architecture for distributed query processing and a series of techniques that are particularly useful for distributed database systems. These techniques include special join techniques, techniques to exploit intra-query parallelism, techniques to reduce communication costs, and techniques to exploit caching and replication of data. Furthermore, the paper discusses di erent kinds of distributed systems such as client-server, middleware (multi-tier), and heterogeneous database systems and shows how query processing works in these systems. Categories and subject descriptors: E.5 [Data]:Files � H.2.4 [Database Management Systems]: distributed databases, query processing � H.2.5 [Heterogeneous Databases]: data translation General terms: algorithms � performance Additional key words and phrases: query optimization � query execution � client-server databases � middleware � multi-tier architectures � database application systems � wrappers� replication � caching � economic models for query processing � dissemination-based information systems 1
A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment
, 1989
"... The join operator has been a cornerstone of relational database systems since their inception. As such, much time and effort has gone into making joins efficient. With the obvious trend towards multiprocessors, attention has focused on efficiently parallelizing the join operation. In this paper we a ..."
Abstract
-
Cited by 147 (14 self)
- Add to MetaCart
The join operator has been a cornerstone of relational database systems since their inception. As such, much time and effort has gone into making joins efficient. With the obvious trend towards multiprocessors, attention has focused on efficiently parallelizing the join operation. In this paper we analyze and compare four parallel join algorithms. Grace and Hybrid hash represent the class of hash-based join methods, Simple hash represents a looping algorithm with hashing, and our last algorithm is the more traditional sort-merge. The Gamma database machine serves as the host for the performance comparison. Gamma’s shared-nothing architecture with commercially available components is becoming increasingly common, both in research and in industry. 1.
Chained Declustering: A New Availability Strategy for Multiprocssor Database
- IN PROCEEDINGS OF 6TH INTERNATIONAL DATA ENGINEERING CONFERENCE
, 1990
"... This paper presents a new strategy for increasing the availability of data in multi-processor, shared-nothing database machines. This technique, termed chained declustering, is demonstrated to provide superior performance in the event of failures while maintaining a very high degree of data availabi ..."
Abstract
-
Cited by 112 (6 self)
- Add to MetaCart
This paper presents a new strategy for increasing the availability of data in multi-processor, shared-nothing database machines. This technique, termed chained declustering, is demonstrated to provide superior performance in the event of failures while maintaining a very high degree of data availability. Furthermore, unlike most earlier replication strategies, the implementation of chained declustering requires no special hardware and only minimal modifications to existing software.
Declustering Using Fractals
- In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems
, 1993
"... We propose a method to achieve declustering for cartesian product files on M units. The focus is on range queries, as opposed to partial match queries that older declustering methods have examined. Our method uses a distance-preserving mapping, namely, the Hilbert curve, to impose a linear ordering ..."
Abstract
-
Cited by 80 (2 self)
- Add to MetaCart
We propose a method to achieve declustering for cartesian product files on M units. The focus is on range queries, as opposed to partial match queries that older declustering methods have examined. Our method uses a distance-preserving mapping, namely, the Hilbert curve, to impose a linear ordering on the multidimensional points (buckets); then, it traverses the buckets according to this ordering, assigning buckets to disks in a round-robin fashion. Thanks to the good distance-preserving properties of the Hilbert curve, the end result is that each disk contains buckets that are far away in the linear ordering, and, most probably, far away in the k-d address space. This is exactly the goal of declustering. Experiments show that these intuitive arguments lead indeed to good performance: the proposed method performs at least as well or better than older declustering schemes. Categories and Subject Descriptors: E.1 [Data Structures]; E.5 [Files]; H.2.2 [Data Base Management]: Physical Des...
Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems
- In VLDB
, 2004
"... We consider the problem of horizontally partitioning a dynamic relation across a large number of disks/nodes by the use of range partitioning. Such partitioning is often desirable in large-scale parallel databases, as well as in peer-to-peer (P2P) systems. As tuples are inserted and deleted... ..."
Abstract
-
Cited by 77 (3 self)
- Add to MetaCart
We consider the problem of horizontally partitioning a dynamic relation across a large number of disks/nodes by the use of range partitioning. Such partitioning is often desirable in large-scale parallel databases, as well as in peer-to-peer (P2P) systems. As tuples are inserted and deleted...
Continuous Retrieval of Multimedia Data Using Parallelism
, 1993
"... Multimedia information systems have emerged as an essential component of many application domains ranging from library information systems to entertainment technology. This is because these systems utilize a variety of human senses to provide an effective means of communicating information. However ..."
Abstract
-
Cited by 66 (12 self)
- Add to MetaCart
Multimedia information systems have emerged as an essential component of many application domains ranging from library information systems to entertainment technology. This is because these systems utilize a variety of human senses to provide an effective means of communicating information. However, most implementations of these systems (based on a workstation) cannot support a continuous display of high resolution audio and video data and suffer from fkequent disruptions and delays termed hiccups. This is due to the low U0 bandwidth of the current disk technology, the high bandwidth requirement of multimedia objects, and the large size of these objects which mquks them to be almost always disk resident. In this paper, we describe a parallel multimedia information system and the key technical ideas that enable it to support a real-time display of multimedii objects. These techniques are as follows. First, we decluster a multimedia object across several disk drives, enabling the system to utilize the aggregate bandwidth of multiple disks to retrieve an object in real-time. Second, the workload of an application is distributed evenly across the disk drives in order to maximize the processing capability of the system. To support simultaneous display of several multimedia objects for different users, we describe two altemative approaches. The first approach multitasks a disk drive among several requests while the second replicates the data and dedicates resources to each individual request. We investigate the trade-offs associated with each approach using a simulation model. Our results demonstrate the superiority of the replication approach.

