Results 1 - 10
of
16
Global Communication Analysis and Optimization
- In Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation
, 1996
"... Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than han ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performan...
A bi-criteria scheduling heuristics for distributed embedded systems under reliability and real-time constraints
- In Int. Conf. on Dependable Systems and Networks, DSN’04
, 2004
"... Multi-criteria scheduling problems, involving optimiza-tion of more than one criterion, are subject to a growing interest. In this paper, we present a new bi-criteria schedul-ing heuristic for scheduling data-flow graphs of operations onto parallel heterogeneous architectures according to two criter ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
(Show Context)
Multi-criteria scheduling problems, involving optimiza-tion of more than one criterion, are subject to a growing interest. In this paper, we present a new bi-criteria schedul-ing heuristic for scheduling data-flow graphs of operations onto parallel heterogeneous architectures according to two criteria: first the minimization of the schedule length, and second the maximization of the system reliability. Reliabil-ity is defined as the probability that none of the system com-ponents will fail while processing. The proposed algorithm is a list scheduling heuristics, based on a bi-criteria com-promise function that introduces priority between the op-erations to be scheduled, and that chooses on what subset of processors they should be scheduled. It uses the active replication of operations to improve the reliability. If the system reliability or the schedule length requirements are not met, then a parameter of the compromise function can be changed and the algorithm re-executed. This process is iterated until both requirements are met.
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
- INTERN. J. HIGH PERF. COMP. APPLICATIONS
, 2005
"... This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to th ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
(Show Context)
This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advantage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the
An Algorithm for Automatically Obtaining Distributed and Fault-Tolerant Static Schedules
- In International Conference on Dependable Systems and Networks, DSN’03
, 2003
"... Our goal is to automatically obtain a distributed and fault-tolerant embedded system: distributed because the system must run on a distributed architecture; fault-tolerant because the system is critical. Our starting point is a source algorithm, a target distributed architecture, some distribution c ..."
Abstract
-
Cited by 30 (7 self)
- Add to MetaCart
(Show Context)
Our goal is to automatically obtain a distributed and fault-tolerant embedded system: distributed because the system must run on a distributed architecture; fault-tolerant because the system is critical. Our starting point is a source algorithm, a target distributed architecture, some distribution constraints, some indications on the execution times of the algorithm operations on the processors of the target architecture, some indications on the communication times of the data-dependencies on the communication links of the target architecture, a number of fail-silent processor failures that the obtained system must tolerate, and finally some real-time constraints that the obtained system must satisfy. In this article, we present a scheduling heuristic which, given all these inputs, produces a fault-tolerant, distributed, and static scheduling of the algorithm on the architecture, with an indication whether or not the real-time constraints are satisfied. The algorithm we propose consist of a list scheduling heuristic based active replication strategy, that allows at least +1 replicas of an operation to be scheduled on different processors, which are run in parallel to tolerate at most failures. Due to the strategy used to schedule operations, simulation results show that the proposed heuristic improve the performance of our method, both in the absence and in the presence of failures.
Partial-evaluation techniques for concurrent programs
- In Proceedings of the ACM SIGPLAN Symposium on Partial Evaluation and Semantics-Based Program Manipulation (PEPM-97). ACM SIGPLAN Notices
, 1997
"... ..."
(Show Context)
A scheduling heuristics for distributed real-time embedded systems tolerant to processor . . .
, 2004
"... ..."
Minimizing Data and Synchronization Costs in One-Way Communication
, 1998
"... In contrast to the conventional send/receive model, the one-way communication model—using Put and Synch—allows the decoupling of message transmission from synchronization. This opens up new opportunities not only to further optimize communication but also to reduce synchronization overhead. In this ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In contrast to the conventional send/receive model, the one-way communication model—using Put and Synch—allows the decoupling of message transmission from synchronization. This opens up new opportunities not only to further optimize communication but also to reduce synchronization overhead. In this paper, we present a general technique which uses a global dataflow framework to optimize communication and synchronization in the context of the oneway communication model. Our approach works with the most general data alignments and distributions in languages like HPF, and is more powerful than other current solutions for eliminating redundant synchronization messages. Preliminary results on several scientific benchmarks demonstrate that our approach is successful in minimizing the number of data and synchronization messages.
Generation of Fault-Tolerant Static Scheduling for Real-Time Distributed Embedded Systems with Multi-Point Links
- In 21st International Conference on Distributed Computing Systems, ICDCS’01
, 2001
"... We describe a solution to automatically produce distributed and fault-tolerant code for real-time distributed embedded systems. The failures supported are processor failures, with fail-stop behavior. Our solution is grafted on the "Algorithm Architecture Adequation" method (AAA), used to o ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We describe a solution to automatically produce distributed and fault-tolerant code for real-time distributed embedded systems. The failures supported are processor failures, with fail-stop behavior. Our solution is grafted on the "Algorithm Architecture Adequation" method (AAA), used to obtain automatically distributed code. The heart of AAA is a scheduling heuristic that produces automatically a static distributed schedule of a given algorithm onto a given distributed architecture. We design a new heuristic in order to obtain a static, distributed and fault-tolerant schedule. The new heuristic schedules supplementary replicas for each computation operation of the algorithm to be distributed and the corresponding communications, where is the number of processor failures intended to be supported. In the same time, the heuristic statically computes the main replica after each failure, such that the execution time is minimized. The analysis of this heuristic shows that it gives better results for distributed architectures using multi-point, reliable links. This solution corresponds to a software implemented fault-tolerance, by mean of software redundancy of algorithm's operations and timing redundancy of communications.
Synchronization elimination in the deposit model
- In Proc. 1996 International Conference on Parallel Processing
, 1996
"... ..."
(Show Context)
An advanced compiler framework for non-cache-coherent multiprocessors
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2002
"... The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multipr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.