Results 1 - 10
of
68
LogP: Towards a Realistic Model of Parallel Computation
, 1993
"... A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding developme ..."
Abstract
-
Cited by 471 (14 self)
- Add to MetaCart
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. It is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.
ACE: And/Or-parallel Copying-based Execution of Logic Programs
, 1991
"... In this paper we present a novel execution model for parallel implementation of logic programs which is capable of exploiting both independent and-parallelism and or-parallelism in an efficient way. This model extends the stack copying approach, which has been successfully applied in the Muse system ..."
Abstract
-
Cited by 62 (38 self)
- Add to MetaCart
In this paper we present a novel execution model for parallel implementation of logic programs which is capable of exploiting both independent and-parallelism and or-parallelism in an efficient way. This model extends the stack copying approach, which has been successfully applied in the Muse system to implement or-parallelism, by integrating it with proven techniques used to support independent and-parallelism. We show how all solutions to non-deterministic andparallel goals are found without repetitions. This is done through recomputation as in Prolog (and in various and-parallel systems, like &-Prolog and DDAS), i.e., solutions of and-parallel goals are not shared. We propose a scheme for the efficient management of the address space in a way that is compatible with the apparently incompatible requirements of both and- and or-parallelism. We also show how the full Prolog language, with all its extra-logical features, can be supported in our and-or parallel system so that its sequent...
Horizons of Parallel Computation
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1993
"... This paper considers the ultimate impact of fundamental physical limitations---notably, speed of light and device size---on parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
This paper considers the ultimate impact of fundamental physical limitations---notably, speed of light and device size---on parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the consequences of the accomplished attainment of the physical bounds. The main result is that scalability holds only for neighborly interconnections, such as the square mesh, of bounded-size synchronous modules, presumably of the area-universal type. We also discuss the ultimate infeasibility of latencyhiding, the violation of intuitive maximal speedups, and the emerging novel processor-time tradeoffs.
Block Data Decomposition for Data-Parallel Programming on a Heterogeneous Workstation Network
, 1993
"... We present a block data decomposition algorithm for two-dimensional grid problems. Our method includes load balancing to accommodate heterogeneous processors, and we characterize the conditions that must be met for our partitioning strategy to be of value. While we concentrate on the workstation net ..."
Abstract
-
Cited by 36 (10 self)
- Add to MetaCart
We present a block data decomposition algorithm for two-dimensional grid problems. Our method includes load balancing to accommodate heterogeneous processors, and we characterize the conditions that must be met for our partitioning strategy to be of value. While we concentrate on the workstation network model of parallel processing because of its high communication costs and inherent heterogeneity, our method is applicable to other parallel architectures. 1 Introduction The concept of the hypercomputer, a virtual parallel machine formed from a network of workstations [4], has made parallel processing available in a wide range of settings. Workstation networks have become commonplace in scientific, academic, and business environments due mainly to their relatively low cost and general-purpose applicability. The current performance capabilities of workstations make them attractive alternatives to expensive specialized machines for many parallel processing applications. Parallel processi...
The KSR1: Experimentation and Modeling of Poststore
, 1993
"... Kendall Square Research introduced the KSR1 system in 1991. The architecture is based on a ring of rings of 64-bit microprocessors. It is a distributed, shared memory system and is scalable. The memory structure is unique and is the key to understanding the system. Different levels of caching elimin ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
Kendall Square Research introduced the KSR1 system in 1991. The architecture is based on a ring of rings of 64-bit microprocessors. It is a distributed, shared memory system and is scalable. The memory structure is unique and is the key to understanding the system. Different levels of caching eliminates physical memory addressing and leads to the ALLCACHE TM scheme. Since requested data may be found in any of several caches, the initial access time is variable. However, once pulled into the local (sub)cache, subsequent access times are fixed and minimal. Thus, the KSR1 is a Cache--Only Memory Architecture (COMA) system. This paper describes experimentation and an analytic model of the KSR1. The focus is on the poststore programmer option. With the poststore option, the programmer can elect to broadcast the updated value of a variable to all processors that might have a copy. This may save time for threads on other processors, but delays the broadcasting thread and places additional t...
Parallel Logic Programming Systems
- Computing Surveys
, 1994
"... Parallelizing logic programming has attracted much interest in the research community, because of the intrinsic OR- and AND-parallelisms of logic programs. One research stream aims at transparent exploitation of parallelism in existing logic programming languages such as Prolog, whale the family of ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Parallelizing logic programming has attracted much interest in the research community, because of the intrinsic OR- and AND-parallelisms of logic programs. One research stream aims at transparent exploitation of parallelism in existing logic programming languages such as Prolog, whale the family of concurrent logic languages develops language constructs allowing programmers to express the concurrency—that is, the communication and synchronization between parallel processes—within their algorithms. This article concentrates mainly on transparent exploitation of parallelism and surveys the most mature solutions to the problems to be solved in order to obtain efficient implementations. These solutions have been implemented, and the most efficient parallel logic programming systems reach effective speedups over state-of-the-art sequential Prolog implementations. The article also addresses current and prospective research issues in extending the applicability and the efficiency of existing systems, such as models merging the transparent parallehsm and the concurrent logic languages approaches, combination of constraint logic programming with parallelism, and use of highly parallel architectures.
An Empirical Comparison of the Kendall Square Research KSR-1 and Stanford DASH Multiprocessors
, 1993
"... Two interesting variants of large-scale shared-addressspace parallel architectures are cache-coherent non-uniformmemory -access machines (CC-NUMA) and cache-only memory architectures (COMA). Both have distributed main memory and use directory-based cache coherence. While both architectures migrate a ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Two interesting variants of large-scale shared-addressspace parallel architectures are cache-coherent non-uniformmemory -access machines (CC-NUMA) and cache-only memory architectures (COMA). Both have distributed main memory and use directory-based cache coherence. While both architectures migrate and replicate data at the cache level automatically under hardware control, COMA machines do this at the main memory level as well. Previous work had discussed the general advantages and disadvantages of the two types of architectures, and presented results comparing the performance of small problems on simulated architectures of the two types. In this paper, we compare the parallel performance of a recent realization of each type of architecture---the Stanford DASH multiprocessor (CC-NUMA) and the Kendall Square Research KSR-1 (COMA). Using a suite of important computational kernels and complete scientific applications, we examine performance differences resulting both from the CCNUMA /COMA ...
Reform Prolog: The Language and its Implementation
- In Proc. of the 10th Int'l Conference on Logic Programming
, 1993
"... Reform Prolog is an (dependent) AND-parallel system based on recursionparallelism and Reform compilation. The system supports selective, userdeclared, parallelization of binding-deterministic Prolog programs (nondeterminism local to each parallel process is allowed). The implementation extends a con ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Reform Prolog is an (dependent) AND-parallel system based on recursionparallelism and Reform compilation. The system supports selective, userdeclared, parallelization of binding-deterministic Prolog programs (nondeterminism local to each parallel process is allowed). The implementation extends a convential Prolog machine with support for data sharing and process managment. Extensive global dataflow analysis is employed to facilitate parallelization. Promising performance figures, showing high parallel efficiency and low overhead for parallelization, have been obtained on a 24 processor shared-memory multiprocessor. The high performance is due to efficient process managment and scheduling, made possible by the execution model. 1 INTRODUCTION Most systems for AND-parallel logic programming defines the procedural meaning of conjunction to be inherently parallel. These designs are based on an ambition to maximize the amount of parallelism in computations. We present and evaluate an approa...
The Design of the Caltech Mosaic C Multicomputer
- In Research on Integrated Systems Symposium Proceedings
, 1993
"... and Introduction The Caltech Mosaic C is an experimental, fine-grain multicomputer that employs single-chip nodes and advanced packaging technology to demonstrate the performance/cost advantages of the fine-grain-multicomputer architecture. Each Mosaic node includes 64KB of single-clock-cycle dynam ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
and Introduction The Caltech Mosaic C is an experimental, fine-grain multicomputer that employs single-chip nodes and advanced packaging technology to demonstrate the performance/cost advantages of the fine-grain-multicomputer architecture. Each Mosaic node includes 64KB of single-clock-cycle dynamic RAM; 2KB of self-test and bootstrap ROM; an 11MIPS processor; a packet interface; and a 60MB/s, two-dimensional, self-timed router. The node is a single, 9.25mm\Theta10.00mm, 1.2¯m-feature-size, CMOS chip that, at V dd = 5V, operates at 30MHz and dissipates 0.5W. These chips are packaged by tapeautomated bonding (TAB) in 8\Theta8 arrays on circuit boards that can, in turn, be composed in two dimensions to construct arbitrarily large arrays of nodes. In addition to the 8\Theta8 boards, complete Mosaic systems require hostinterface boards that allow workstations to send and receive packets on Mosaic channels, and high-bandwidth cables. The Mosaic host-interface boards are built using memor...
Locality And Loop Scheduling On Numa Multiprocessors
- in Proceedings of the 1993 International Conference on Parallel Processing
, 1993
"... An important issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. While most existing dynamic scheduling algorithms manage load imbalances well, they fail to take locality into account and therefore perform poorly on parallel systems with ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
An important issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. While most existing dynamic scheduling algorithms manage load imbalances well, they fail to take locality into account and therefore perform poorly on parallel systems with non-uniform memory access times. In this paper, we propose a new loop scheduling algorithm, Locality-based Dynamic Scheduling (LDS), that exploits locality, and dynamically balances the load. Key Words: Locality, Loop Scheduling, NUMA Multiprocessors, Data Partitioning, Locality-based Dynamic Scheduling. 1 Introduction Loops are a major source of parallelism for todays parallelizing compilers. An important issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. A number of algorithms have been proposed for this purpose. For example, static scheduling algorithms such as block, cyclic, and block-cyclic scheduling, partition the loo...

