Results 1 - 10
of
88
Query evaluation techniques for large databases
- ACM COMPUTING SURVEYS
, 1993
"... Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible database systems will not solve this problem. On ..."
Abstract
-
Cited by 592 (7 self)
- Add to MetaCart
Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible database systems will not solve this problem. On the contrary, modern data models exacerbate it: In order to manipulate large sets of complex objects as efficiently as today’s database systems manipulate simple records, query processing algorithms and software will become more complex, and a solid understanding of algorithm and architectural issues is essential for the designer of database management software. This survey provides a foundation for the design and implementation of query execution facilities in new database management systems. It describes a wide array of practical query evaluation techniques for both relational and post-relational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.
Parallel database systems: the future of high performance database systems
- Communications of the ACM
, 1992
"... Abstract: Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper ..."
Abstract
-
Cited by 466 (8 self)
- Add to MetaCart
Abstract: Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper reviews the techniques used by such systems, and surveys current commercial and research systems. 1.
Staggered Striping in Multimedia Information Systems
, 1994
"... Multimedia information systems have emerged as an essential component of many application domains ranging from library information systems to entertainment technology. However, most implementations of these systems cannot support the continuous display of multimedia objects and suffer from freque ..."
Abstract
-
Cited by 163 (31 self)
- Add to MetaCart
Multimedia information systems have emerged as an essential component of many application domains ranging from library information systems to entertainment technology. However, most implementations of these systems cannot support the continuous display of multimedia objects and suffer from frequent disruptions and delays termed hiccups. This is due to the low I/O bandwidth of the current disk technology, the high bandwidth requirement of multimedia objects, and the large size of these objects that almost always requires them to be disk resident. One approach to resolve this limitation is to decluster a multimedia object across multiple disk drives in order to employ the aggregate bandwidth of several disks to support the continuous retrieval (and display) of objects. This paper describes staggered striping as a novel technique to provide effective support for multiple users accessing the different objects in the database. Detailed simulations confirm the superiority of stagge...
Dataflow Query Execution in a Parallel Main-Memory Environment
- Distributed and Parallel Databases
, 1991
"... In this paper, the performance and characteristics of the execution of various join-trees on a parallel DBMS are studied. The results of this study, are a step into the direction of the design of a query optimization strategy that is fit for parallel execution of complex queries. Among others, synch ..."
Abstract
-
Cited by 159 (4 self)
- Add to MetaCart
In this paper, the performance and characteristics of the execution of various join-trees on a parallel DBMS are studied. The results of this study, are a step into the direction of the design of a query optimization strategy that is fit for parallel execution of complex queries. Among others, synchronization issues are identified to limit the performance gain from parallelism. A new hashjoin algorithm, called Pipelining hash-join is introduced that has fewer synchronization constraints than the known hash-join algorithms. Also, the behavior of individual join operations in a join-tree is studied in a simulation experiment. The results show that the Pipelining hash-join algorithm yields a better performance for multi-join queries. Also, the format of the optimal join-tree appears to depend on the size of the operands of the join: A multi-join between small operands performs best with a bushy schedule; larger operands are better off with a linear schedule. The results from the simulatio...
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
"... DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed.NET objects; and by s ..."
Abstract
-
Cited by 89 (18 self)
- Add to MetaCart
DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed.NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language. A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary sideeffect-free transformations on datasets, and can be written and debugged using standard.NET development tools. The DryadLINQ system automatically and transparently translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures efficient, reliable execution of this plan. We describe the implementation of the DryadLINQ compiler and runtime. We evaluate DryadLINQ on a varied set of programs drawn from domains such as web-graph analysis, large-scale log mining, and machine learning. We show that excellent absolute performance can be attained—a general-purpose sort of 1012 Bytes of data executes in 319 seconds on a 240-computer, 960disk cluster—as well as demonstrating near-linear scaling of execution time on representative applications as we vary the number of computers used for a job. 1
Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison
, 1995
"... The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Asso ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Association Rules, particularly from retail data. The task is to determine patterns (or rules) that characterize the shopping behavior of customers from a large database of previous consumer transactions. The rules can then be used to focus marketing efforts such as product placement and sales promotions. Because early algorithms required an unpredictably large number of IO operations, reducing IO cost has been the primary target of the algorithms presented in the literature. One of the most recent proposed algorithms, called PARTITION, uses a new TID-list data representation and a new partitioning technique. The partitioning technique reduces IO cost to a constant amount by processing one datab...
Parallel Database Systems: The Future of Database Processing or a Passing Fad?
- SIGMOD RECORD
, 1991
"... Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper reviews th ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper reviews the techniques used by such systems, and surveys current commercial and research systems.
Conflict-aware scheduling for dynamic content applications
- In Proceedings of the Fifth USENIX Symposium on Internet Technologies and Systems
, 2003
"... We present a new lazy replication technique, suitable for scaling the back-end database of a dynamic content site using a cluster of commodity computers. Our technique, called conflict-aware scheduling, provides both throughput scaling and 1-copy serializability. It has generally been believed that ..."
Abstract
-
Cited by 43 (8 self)
- Add to MetaCart
We present a new lazy replication technique, suitable for scaling the back-end database of a dynamic content site using a cluster of commodity computers. Our technique, called conflict-aware scheduling, provides both throughput scaling and 1-copy serializability. It has generally been believed that this combination is hard to achieve through replication because of the growth of the number of conflicts. We take advantage of the presence in a database cluster of a scheduler through which all incoming requests pass. We require that transactions specify the tables that they access at the beginning of the transaction. Using that information, a conflictaware scheduler relies on a sequence-numbering scheme to implement 1-copy serializability, and directs incoming queries in such a way that the number of conflicts is reduced. We evaluate conflict-aware scheduling using the TPC-W e-commerce benchmark. For small clusters of up to eight database replicas, our evaluation is performed through measurements of a web site implementing the TPC-W specification. We use simulation to extend our measurement results to larger clusters, faster database engines, and lower conflict rates. Our results show that conflict-awareness brings considerable benefits compared to both eager and conflictoblivious lazy replication for a large range of cluster sizes, database speeds, and conflict rates. Conflict-aware scheduling provides near-linear throughput scaling up to a large number of database replicas for the browsing and shopping workloads of TPC-W. For the write-heavy ordering workload, throughput scales, but only to a smaller number of replicas. 1
A Performance Analysis of Alternative Multi-Attribute Declustering Strategies
- In Proc. ACM SIGMOD Int. Conf. on Management of Data
, 1992
"... During the past decade, parallel database systems have gained increased popularity due to their high performance, scalability and availability characteristics. With the predicted future database sizes and the complexity of queries, the scalability of these systems to hundreds and thousands of pro ..."
Abstract
-
Cited by 41 (4 self)
- Add to MetaCart
During the past decade, parallel database systems have gained increased popularity due to their high performance, scalability and availability characteristics. With the predicted future database sizes and the complexity of queries, the scalability of these systems to hundreds and thousands of processors is essential for satisfying the projected demand. Several studies have repeatedly demonstrated that both the performance and scalability of a parallel database system is contingent on the physical layout of data across the processors of the system. If the data is not declustered properly, the execution of an operator might waste resources, reducing the overall processing capability of the system.
Data Placement in Shared-Nothing Parallel Database Systems
, 1994
"... . Data placement in shared-nothing database systems has been studied extensively in the past and various placement algorithms have been proposed. However, there is no consensus on the most efficient data placement algorithm and placement is still performed manually by a database administ ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
.<F3.733e+05> Data placement in shared-nothing database systems has been studied extensively in the past and various placement algorithms have been proposed. However, there is no consensus on the most efficient data placement algorithm and placement is still performed manually by a database administrator with periodic reorganization to correct mistakes. This paper presents the first comprehensive simulation study of data placement issues in a shared-nothing system. The results show that current hardware technology trends have significantly changed the performance tradeoffs considered in past studies. A simplistic data placement strategy based on the new results is developed and shown to perform well for a variety of workloads.<F7.947e+05> Key words:<F3.733e+05> Declustering -- Disk allocation -- Resource allocation -- Resource scheduling<F7.947e+05> 1 Introduction<F3.733e+05> The last decade has seen a significant change in the characteristics of database applications. The demands of...

