Results 1 - 10
of
65
Query evaluation techniques for large databases
- ACM COMPUTING SURVEYS
, 1993
"... Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible database systems will not solve this problem. On ..."
Abstract
-
Cited by 592 (7 self)
- Add to MetaCart
Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible database systems will not solve this problem. On the contrary, modern data models exacerbate it: In order to manipulate large sets of complex objects as efficiently as today’s database systems manipulate simple records, query processing algorithms and software will become more complex, and a solid understanding of algorithm and architectural issues is essential for the designer of database management software. This survey provides a foundation for the design and implementation of query execution facilities in new database management systems. It describes a wide array of practical query evaluation techniques for both relational and post-relational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.
Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors
- ACM Transactions on Computer Systems
, 1991
"... Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become marke ..."
Abstract
-
Cited by 433 (29 self)
- Add to MetaCart
Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible ag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory. We present a new scalable algorithm for spin locks that generates O(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than
Midway: Shared Memory Parallel Programming with Entry Consistency for Distributed Memory Multiprocessors
, 1991
"... Distributed memory multiprocessing offers a cost-effective and scalable solution for a large class of scientific and numeric applications. Unfortunately, the performance of current distributed memory programming environments suffers because the frequency of communication between processors can excee ..."
Abstract
-
Cited by 170 (0 self)
- Add to MetaCart
Distributed memory multiprocessing offers a cost-effective and scalable solution for a large class of scientific and numeric applications. Unfortunately, the performance of current distributed memory programming environments suffers because the frequency of communication between processors can exceed that required to ensure a correctly functioning program. Midway is a shared memory parallel programming system which addresses the problem of excessive communication in a distributed memory multiprocessor. Midway programs are written using a conventional MIMD-style programming model executing within a single globally shared memory. Local memories on each processor cache recently used data to counter the effects of network latency. Midway is based on a new model of memory consistency called entry consistency. Entry consistency exploits the relationship between synchronization objects and the data which they protect. Updates to shared data are communicated between processors only when not ...
DDM - A Cache-Only Memory Architecture
- IEEE Computer
, 1992
"... The long latencies introduced by remote accesses in a large multiprocessor can be hidden by caching. Caching also decreases the network load. We introduce a new class of architectures called Cache Only Memory Architectures (COMA). These architectures provide the programming paradigm of the shared-me ..."
Abstract
-
Cited by 137 (8 self)
- Add to MetaCart
The long latencies introduced by remote accesses in a large multiprocessor can be hidden by caching. Caching also decreases the network load. We introduce a new class of architectures called Cache Only Memory Architectures (COMA). These architectures provide the programming paradigm of the shared-memory architectures, but have no physically shared memory; instead, the caches attached to the processors contain all the memory in the system, and their size is therefore large. A datum is allowed to be in any or many of the caches, and will automatically be moved to where it is needed by a cache-coherence protocol, which also ensures that the last copy of a datum is never lost. The location of a datum in the machine is completely decoupled from its address. We also introduce one example of COMA: the Data Diffusion Machine (DDM), and its simulated performance for large applications. The DDM is based on a hierarchical network structure, with processor/memory pairs at its tips. Remote accesses...
Scope Consistency : A Bridge between Release Consistency and Entry Consistency
- In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1996
"... The large granularity of communication and coherence in shared virtual memory systems causes problems with false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy R ..."
Abstract
-
Cited by 135 (12 self)
- Add to MetaCart
The large granularity of communication and coherence in shared virtual memory systems causes problems with false sharing and extra communication. Relaxed memory consistency models have been used to alleviate these problems, but at a cost in programming complexity. Release Consistency (RC) and Lazy Release Consistency (LRC) are accepted to offer a reasonable tradeoff between performance and programming complexity. Entry Consistency (EC) offers a more relaxed consistency model, but it requires explicit association of shared data objects with synchronization variables. The programming burden of providing such associations can be substantial. This paper proposes a new consistency model for shared virtual memory, called Scope Consistency (ScC), which offers most of the potential performance advantages of the EC model without requiring explicit bindings between data and synchronization variables. Instead, ScC dynamically detects the bindings implied by the programmer allowing a programming i...
False Sharing and Spatial Locality in Multiprocessor Caches
- IEEE Transactions on Computers
, 1992
"... The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache mi ..."
Abstract
-
Cited by 114 (4 self)
- Add to MetaCart
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in this paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We...
The Multiscalar Architecture
, 1993
"... The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent t ..."
Abstract
-
Cited by 113 (8 self)
- Add to MetaCart
The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess-ing paradigms, and shares a number of properties of the sequential processing model and the dataflow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis-calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen-tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul-tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an efficient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the
The Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting finegrain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain parallelism by overlapping the execution of multipl ..."
Abstract
-
Cited by 110 (10 self)
- Add to MetaCart
We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting finegrain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain parallelism by overlapping the execution of multiple windows. The basic idea is to connect multiple sequential processors, in a decoupled and decentralized manner, to achieve overall multiple issue. This processing paradigm shares a number of properties of the restricted dataflow machines, but was derived from the sequential von Neumann architecture. We also present an implementation of the Expandable Split Window execution model, and preliminary performance results. 1. INTRODUCTION The execution of a program, in an abstract form, can be considered to be a dynamic dataflow graph that encapsulates the data dependencies in the program. The nodes of the graph represent computation operations, and the arcs of the graph represent communication o...
Lazy Caching
- ACM Transactions on Programming Languages and Systems
, 1993
"... This paper examines cache consistency conditions for multiprocessor shared memory systems. It states and motivates a weaker condition than is normally implemented. An algorithm is presented that exploits the weaker condition to achieve greater concurrency. The algorithm is shown to satisfy the weak ..."
Abstract
-
Cited by 77 (0 self)
- Add to MetaCart
This paper examines cache consistency conditions for multiprocessor shared memory systems. It states and motivates a weaker condition than is normally implemented. An algorithm is presented that exploits the weaker condition to achieve greater concurrency. The algorithm is shown to satisfy the weak consistency condition. Other properties of the algorithm and possible extensions are discussed.
MULTIPROCESSOR SCHEDULING TO ACCOUNT FOR INTERPROCESSOR COMMUNICATION
, 1991
"... Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essenti ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essential for attaining efficient hardware utilization. This thesis introduces two new compile-time heuristics for scheduling precedence graphs onto multiprocessor architectures, which account for interprocessor communication overheads and interconnection constraints in the architecture. These algorithms perform scheduling and routing simultaneously to account for irregular interprocessor interconnections, and schedule all communications as well as all computations to eliminate shared resource contention. The first technique, called dynamic-level scheduling, modifies the classical HLFET list scheduling strategy to account for IPC and synchronization overheads. By using dynamically changing priorities to match nodes and processors at each step, this technique attains an equitable tradeoff between load balancing and interprocessor communication cost. This method is fast, flexible, widely targetable, and displays promising perforrnance. The second technique, called declustering, establishes a parallelism hierarchy upon the precedence graph using graph-analysis techniques which explicitly address the tradeoff between exploiting parallelism and incurring communication cost. By systematically decomposing this hierarchy, the declustering process exposes parallelism instances in order of importance, assuring efficient use of the available processing resources. In contrast with traditional clustering schemes, this technique can adjust the level of cluster granularity to suit the characteristics of the specified architecture, leading to a more effective solution.

