Results 1 - 10
of
27
Efficient Support for Irregular Applications on Distributed-Memory Machines
, 1995
"... Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and a ..."
Abstract
-
Cited by 90 (13 self)
- Add to MetaCart
Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance. This paper explores three issues -- partitioning, mutual exclusion, and data transfer -- crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.
Interprocedural array regions analyses
, 1995
"... In order to perform powerful program optimizations, an exact interprocedural analysis of array data ow is needed. For that purpose, two new types of array region are introduced. IN and OUT regions represent the sets of array elements, the values of which are imported to or exported from the current ..."
Abstract
-
Cited by 73 (9 self)
- Add to MetaCart
(Show Context)
In order to perform powerful program optimizations, an exact interprocedural analysis of array data ow is needed. For that purpose, two new types of array region are introduced. IN and OUT regions represent the sets of array elements, the values of which are imported to or exported from the current statement or procedure. Among the various applications are: compilation of communications for message-passing machines, array privatization, compile-time optimization of local memory or cache behavior in hierarchical memory machines.
Enhancing Software DSM for Compiler-Parallelized Applications
- In Proceedings of the 11th International Parallel Processing Symposium
, 1997
"... Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. We demonstrate such ..."
Abstract
-
Cited by 46 (15 self)
- Add to MetaCart
(Show Context)
Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. We demonstrate such a system by combining the SUIF parallelizing compiler and the CVM software DSM. Innovations of the system include compiler-directed techniques that: 1) combine synchronization and parallelism information communication on parallel task invocation, 2) employ customized routines for evaluating reduction operations, and 3) select a hybrid update protocol that pre-sends data by flushing updates at barriers. For applications with sufficient granularity of parallelism, these optimizations yield very good speedups eight processors on an IBM SP-2 and DEC Alpha cluster, usually matching or exceeding the speedup of equivalent HPF and message-passing versions of each program. Based on our experimental ...
Strings: A High-Performance Distributed Shared Memory for Symmetrical Multiprocessor Clusters
- in Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing
, 1998
"... This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these ligh ..."
Abstract
-
Cited by 21 (11 self)
- Add to MetaCart
(Show Context)
This paper describes Strings, a multi-threaded DSM developed by us. The distinguishing feature of Strings is that it incorporates Posix1.c threads multiplexed on kernel light-weight processes for better performance. The kernel can schedule multiple threads across multiple processors using these lightweight processes. Thus, Strings is designed to exploit data parallelism at the application level and task parallelism at the DSM system level. We show how using multiple kernel threads can improve the performance even in the presence of false sharing, using matrix multiplication as a case-study. We also show the performance results with benchmark programs from the SPLASH-2 suite [17]. Though similar work has been demonstrated with SoftFLASH [18], our implementation is completely in user space and thus more portable. Some other researach has studied the effect of clustering in SMPs suing simulations [19]. We have shown results from runs on an actual network of SMPs
An integrated compiler/run-time system for global data distribution in distributed shared memory systems
- In Second Workshop on Software Distributed Shared Memory
, 2000
"... A software distributed shared memory (DSM) provides the illusion of shared memory on a distributed-memory machine; communication occurs implicitly via page faults. For efficient execution of DSM programs, the threads and their implicitly associated data must be distributed to the nodes to balance th ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
A software distributed shared memory (DSM) provides the illusion of shared memory on a distributed-memory machine; communication occurs implicitly via page faults. For efficient execution of DSM programs, the threads and their implicitly associated data must be distributed to the nodes to balance the computational workload and minimize communication due to page faults. The focus of this paper is on finding effective data distributions in DSM systems both within and across all computational phases. Our model takes into account data redistribution between phases. We have designed and implemented an integrated compiler/runtime system called SUIF-Adapt. The compiler, which is an extended version of SUIF, divides the program into phases, analyzes each, and communicates important information to the run-time system. We use an extended version of Adapt, a run-time data distribution system, to take measurements on an iteration of a loop consisting of one or more phases. It then finds the global data distribution for the loop (over a reasonable set of distributions) that leads to the best completion time. Performance results indicate that programs that use SUIF-Adapt can outperform programs with predetermined data distributions when phase behavior is dependent on run-time values of input data; in such cases, statically determining an effective data distribution requires (generally unavailable) prior knowledge of run-time behavior of an application.
Compile-time Synchronization Optimizations for Software DSMs
, 1998
"... Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for elimi ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
(Show Context)
Software distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating synchronization overhead in software DSMs, developing new algorithms to handle situations found in practice. We evaluate the contributions of synchronization elimination algorithms based on 1) dependence analysis, 2) communication analysis, 3) exploiting coherence protocols in software DSMs, and 4) aggressive expansion of parallel SPMD regions. We also found suppressing expensive parallelism to be useful for one application. Experiments indicate these techniques eliminate almost all parallel task invocations, and reduce the number of barriers executed by 66% on average. On a 16 processor IBM SP-2, speedups are improved on average by 35%, and are tripled for some applications.
Eliminating Barrier Synchronization for Compiler-Parallelized Codes on Software DSMs
- International Journal of Parallel Programming
, 1998
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imb ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance ...
Improving the Compiler/Software DSM Interface: Preliminary Results
- in Proceedings of the First SUIF Compiler Workshop
, 1996
"... Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. Preliminary results ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Current parallelizing compilers for message-passing machines only support a limited class of data-parallel applications. One method for eliminating this restriction is to combine powerful shared-memory parallelizing compilers with software distributed-shared-memory (DSM) systems. Preliminary results show simply combining the parallelizer and software DSM yields very poor performance. The compiler/software DSM interface can be improved based on relatively little compiler input by: 1) combining synchronization and parallelism information communication on parallel task invocation, 2) employing customized routines for evaluating reduction operations, and 3) selecting a hybrid update protocol to presend data by flushing updates at barriers. These optimizations yield decent speedups for program kernels, but are not sufficient for entire programs. Based on our experimental results, we point out areas where additional compiler analysis and software DSM improvements are necessary to achieve goo...
SVMview: a Performance Tuning Tool for DSM-based Parallel Computers
- IRISA, CAMPUS DE BEAULIEU, 35042 RENNES CEDEX
, 1996
"... This paper describes a performance tuning tool, named SVMview, for DSM-based parallel computers. SVMview is a tool for doing a post-mortem analysis of page movements generated during the execution of Fortran-S programs. This tool is able to analyze some particular phenomena such as false-sharing whi ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper describes a performance tuning tool, named SVMview, for DSM-based parallel computers. SVMview is a tool for doing a post-mortem analysis of page movements generated during the execution of Fortran-S programs. This tool is able to analyze some particular phenomena such as false-sharing which occurs when several processors write to the same page simultaneously. Such behavior entails badly the performance of DSM systems, and thus are particularly important to be detected.
Reducing synchronization overhead for compiler-parallelized codes on software DSMs
- Languages and Compilers for Parallel Computing, Tenth International Workshop, LCPC'97, volume 1366 of Lecture Notes in Computer Science
, 1997
"... Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imba ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
(Show Context)
Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to message-passing compilers for dense-matrix kernels. However, synchronizationand load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: 1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops, 2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences, 3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20 % on average and by up to 60 % for some applications. 1