Results 1 - 10
of
157
Compiler Optimizations for Eliminating Barrier Synchronization
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1995
"... This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By ..."
Abstract
-
Cited by 91 (13 self)
- Add to MetaCart
(Show Context)
This paper presents novel compiler optimizations for reducing synchronization overhead in compiler-parallelized scientific codes. A hybrid programming model is employed to combine the flexibility of the fork-join model with the precision and power of the singleprogram, multiple data (SPMD) model. By exploiting compiletime computation partitions, communication analysis can eliminate barrier synchronization or replace it with less expensive forms of synchronization. We show computation partitions and data communication can be represented as systems of symbolic linear inequalities for high flexibility and precision. These optimizations has been implemented in the Stanford SUIF compiler. We extensively evaluate their performance using standard benchmark suites. Experimental results show barrier synchronization is reduced 29% on averageand by several orders of magnitude for certain programs. 1 Introduction Parallel machines with shared address spaces and coherent caches provide an attracti...
Customized Dynamic Load Balancing for a Network of Workstations
, 1997
"... this paper we show that different load balancing schemes are best for different applications under varying program and system parameters. Therefore, application-driven customized dynamic load balancing becomes essential for good performance. We present a hybrid compile-time and run-time modeling and ..."
Abstract
-
Cited by 82 (1 self)
- Add to MetaCart
(Show Context)
this paper we show that different load balancing schemes are best for different applications under varying program and system parameters. Therefore, application-driven customized dynamic load balancing becomes essential for good performance. We present a hybrid compile-time and run-time modeling and decision process which selects (customizes) the best scheme, along with automatic generation of parallel code with calls to a run-time library for load balancing. 1997 Academic Press 1.
Scheduling multiprocessor tasks -- An overview
- EUROPEAN JOURNAL OF OPERATIONAL RESEARCH
, 1996
"... Multiprocessor tasks require more than one processor at the same moment of time. This relatively new concept in scheduling theory emerged with the advent of parallel computing systems. In this work we present the state of the art for multiprocessor task scheduling. We show the rationale behind the c ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
Multiprocessor tasks require more than one processor at the same moment of time. This relatively new concept in scheduling theory emerged with the advent of parallel computing systems. In this work we present the state of the art for multiprocessor task scheduling. We show the rationale behind the concept of multiprocessor tasks. The standard three-field notation is extended to accommodate multiprocessor tasks. The main part of the work is presentation of the results in multiprocessor tasks scheduling both for parallel and for dedicated processors.
Measuring Synchronisation and Scheduling Overheads in OpenMP
, 1999
"... Overheads due to synchronisation and loop scheduling are an important factor in determining the performance of shared memory parallel programs. We present set of benchmarks to measure these classes of overhead for language constructs in OpenUP. Pesults are presented for three different hardware plat ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
(Show Context)
Overheads due to synchronisation and loop scheduling are an important factor in determining the performance of shared memory parallel programs. We present set of benchmarks to measure these classes of overhead for language constructs in OpenUP. Pesults are presented for three different hardware platforms, each with its own implementation of OpenUP. Significant differences are observed, which suggest possible means of improving performance.
Enhancing Software DSM for Compiler-Parallelized Applications
- In Proceedings of the 11th International Parallel Processing Symposium
, 1997
"... Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. We demonstrate such ..."
Abstract
-
Cited by 46 (15 self)
- Add to MetaCart
(Show Context)
Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. We demonstrate such a system by combining the SUIF parallelizing compiler and the CVM software DSM. Innovations of the system include compiler-directed techniques that: 1) combine synchronization and parallelism information communication on parallel task invocation, 2) employ customized routines for evaluating reduction operations, and 3) select a hybrid update protocol that pre-sends data by flushing updates at barriers. For applications with sufficient granularity of parallelism, these optimizations yield very good speedups eight processors on an IBM SP-2 and DEC Alpha cluster, usually matching or exceeding the speedup of equivalent HPF and message-passing versions of each program. Based on our experimental ...
Impact of Sharing-Based Thread Placement on Multithreaded Architectures
- In Proceedings of the 21st Annual International Symposium on Computer Architecture
, 1994
"... Multithreaded architectures context switch to another instruction stream to hide the latency of memory operations. Although the technique improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-loca ..."
Abstract
-
Cited by 44 (2 self)
- Add to MetaCart
(Show Context)
Multithreaded architectures context switch to another instruction stream to hide the latency of memory operations. Although the technique improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate on the same processor threads that share data. The multi-thread sharing in the cache should reduce compulsory and invalidation misses, benefiting execution time. To test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained fairly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the cr...
The Effectiveness of Multiple Hardware Contexts
- In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache confl ..."
Abstract
-
Cited by 44 (1 self)
- Add to MetaCart
(Show Context)
Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time. The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value.The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with...
Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations
- In Proceedings of Supercomputing’95 (CD-ROM
, 1995
"... Although N-body simulation algorithms are amenable to parallelization, performance gains from execution on parallel machines are difficult to obtain due to load imbalances caused by irregular distributions of bodies. In general, there is a tension between balancing processor loads and maintaining lo ..."
Abstract
-
Cited by 36 (14 self)
- Add to MetaCart
Although N-body simulation algorithms are amenable to parallelization, performance gains from execution on parallel machines are difficult to obtain due to load imbalances caused by irregular distributions of bodies. In general, there is a tension between balancing processor loads and maintaining locality, as the dynamic re-assignment of work necessitates access to remote data. Fractiling is a dynamic scheduling scheme that simultaneously balances processor loads and maintains locality by exploiting the self-similarity properties of fractals. Fractiling is based on a probabilistic analysis, and thus, accommodates load imbalances caused by predictable phenomena, such as irregular data, and unpredictable phenomena, such as data-access latencies. In experiments on a KSR1, performance of N-body simulation codes were improved by as much as 53% by fractiling. Performance improvements were obtained on uniform and nonuniform distributions of bodies, underscoring the need for a scheduling schem...
Compile-time Scheduling Algorithms for Heterogeneous Network of Workstations
- THE COMPUTER JOURNAL
, 1997
"... In this paper, we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of workstations. We consider heterogeneity in various aspects of parallel programming: program, processor, memory and network. A heterogeneous program has parallel loops with different amount ..."
Abstract
-
Cited by 36 (2 self)
- Add to MetaCart
In this paper, we study the problem of scheduling parallel loops at compile-time for a heterogeneous network of workstations. We consider heterogeneity in various aspects of parallel programming: program, processor, memory and network. A heterogeneous program has parallel loops with different amount of work in each iteration; heterogeneous processors have different speeds; heterogeneous memory refers to the different amount of user-available memory on the machines; and a heterogeneous network has different cost of communication between processors. We propose a simple yet comprehensive model for use in compiling for a network of processors, and develop compiler algorithms for generating optimal and