Results 11 -
16 of
16
Configurable Collective Communication in LAM-MPI
"... Abstract. In an earlier paper, we observed that PastSet (our experimental tuple space system) was 1.83 times faster on global reductions than LAM-MPI. Our hypothesis was that this was due to the better resource usage of the PATHS framework (an extension to PastSet that supports orchestration and con ..."
Abstract
- Add to MetaCart
Abstract. In an earlier paper, we observed that PastSet (our experimental tuple space system) was 1.83 times faster on global reductions than LAM-MPI. Our hypothesis was that this was due to the better resource usage of the PATHS framework (an extension to PastSet that supports orchestration and configuration) due to a mapping of the communication and operations which matched the computing resources and cluster topology better. This paper reports on an experiment to verify this and represents ongoing work to add some of the same configurability of PastSet and PATHS to MPI. We show that by adding run-time configurable collective communication, we can reduce the latencies without recompiling the application source code. For the same cluster where we experienced the faster PastSet, we show that Allreduce with our configuration mechanism is 1.79 times faster than the original LAM-MPI Allreduce. We also experiment with the configuration mechanism on 3 different cluster platforms with 2-, 4-, and 8-way nodes. For the cluster of 8-way nodes, we show an improvement by a factor of 1.98 for Allreduce. 1
Evaluating the Performance of the Allreduce Collective Operation on Clusters: Approach and Results
"... The performance of the collective operations provided by a communication library is important for many applications run on clusters. The communication structure of collective operations can be organized as a tree. Performance can be improved by configuring and mapping the tree to the clusters in use ..."
Abstract
- Add to MetaCart
The performance of the collective operations provided by a communication library is important for many applications run on clusters. The communication structure of collective operations can be organized as a tree. Performance can be improved by configuring and mapping the tree to the clusters in use. We describe and demonstrate an approach for evaluating the performance of different configurations and mappings of allreduce run on clusters of different size, consisting of single-CPU hosts, and SMPs with a different number of CPUs. A breakdown of the cost of allreduce using the best configuration on different clusters is provided. For all, the broadcast part is more expensive than the reduce part. Inter-host communication contributes more to the time per allreduce than the synchronization in the allreduce components. For the small messages sizes used (4 and 256 bytes), the time spent computing the partial reductions is insignificant. Reconfiguring hierarchy aware trees improved performance up to a factor of 1.49, by avoiding scalability problems of the components on SMPs, and by finding the right balance between available concurrency, load on ’root ’ hosts and the number of network links in a tree. Extending a tree by adding more threads, or by combining two trees does not have a negative influence on the performance of a configuration, but increasing message size does. 1
Featherweight Threads for Communication
"... Message-passing is an attractive thread coordination mechanism because it cleanly delineates points in an execution when threads communicate, and unifies synchronization and communication. To enable greater performance, asynchronous or non-blocking extensions are usually provided, that allow senders ..."
Abstract
- Add to MetaCart
Message-passing is an attractive thread coordination mechanism because it cleanly delineates points in an execution when threads communicate, and unifies synchronization and communication. To enable greater performance, asynchronous or non-blocking extensions are usually provided, that allow senders and receivers to proceed even if a matching partner is unavailable. However, realizing the potential of these techniques in practice has remained a difficult endeavor due to the associated runtime overheads. We introduce parasitic threads, a novel mechanism for expressing asynchronous computation, aimed at reducing runtime overheads. Parasitic threads are implemented as raw stack frames within the context of its host — a lightweight thread. Parasitic threads are self-scheduled and migrated based on their communication patterns. This avoids scheduling overheads and localizes interaction between parasites. We describe an implementation of parasitic threads and illustrate their utility in building efficient asynchronous primitives. We present an extensive evaluation of parasitic threads in a large collection of benchmarks, including a full fledged web-server. Evaluation is performed on two garbage collection schemes — a stopthe-world garbage collector and a split-heap parallel garbage collector — and shows that parasitic threads improve the performance in both cases. 1.
Test Suite for Evaluating Performance of Multithreaded MPI Communication
"... As parallel systems are commonly being built out of increasingly large multicore chips, application programmers are exploring the use of hybrid programming models combining MPI across nodes and multithreading within a node. Many MPI implementations, however, are just starting to support multithreade ..."
Abstract
- Add to MetaCart
As parallel systems are commonly being built out of increasingly large multicore chips, application programmers are exploring the use of hybrid programming models combining MPI across nodes and multithreading within a node. Many MPI implementations, however, are just starting to support multithreaded MPI communication, often focussing on correctness first and performance later. As a result, both users and implementers need some measure for evaluating the multithreaded performance of an MPI implementation. In this paper, we propose a number of performance tests that are motivated by typical application scenarios. These tests cover the overhead of providing the MPI THREAD MULTIPLE level of thread safety for user programs, the amount of concurrency in different threads making MPI calls, the ability to overlap communication with computation, and other features. We present performance results with this test suite on several platforms (Linux cluster,
SCALABLE COLLECTIVE MESSAGE-PASSING ALGORITHMS BY
"... Governments, universities, and companies expend vast resources building the top supercomputers. The processors and interconnect networks become faster, while the number of nodes grows exponentially. Problems of scale emerge, not least of which is collective performance. This thesis identifies and pr ..."
Abstract
- Add to MetaCart
Governments, universities, and companies expend vast resources building the top supercomputers. The processors and interconnect networks become faster, while the number of nodes grows exponentially. Problems of scale emerge, not least of which is collective performance. This thesis identifies and proposes solutions for two major scalability problems. Our first contribution is a novel algorithm for process-partitioning and remapping for exascale systems that has far better time and space scaling than known algorithms. Our evaluations predict an improvement of up to 60x for large exascale systems and arbitrary reduction in the large temporary buffer space required for generating new communicators. Our second contribution consists of several novel collective algorithms for Clos and torus networks. Known allgather, reduce-scatter, and composite algorithms for Clos networks suffer the worst congestion when the largest messages are exchanged, damaging performance. Known algorithms for torus networks use only one network port, regardless of how many are available.
Efficient Multithreaded Context ID Allocation in MPI ⋆
"... Abstract. An important aspect of support for multithreaded MPI executions is the management of communication context identifiers (IDs), which are used to associate MPI communication operations with a communicator. New communicator creation functionality in MPI 3.0 adds complexity to this core resour ..."
Abstract
- Add to MetaCart
Abstract. An important aspect of support for multithreaded MPI executions is the management of communication context identifiers (IDs), which are used to associate MPI communication operations with a communicator. New communicator creation functionality in MPI 3.0 adds complexity to this core resource management problem. We present an efficient algorithm for multithreaded context ID allocation that builds on an existing production algorithm developed to support MPI 2.2. Through this work, we have discovered a subtle concurrency bug in the existing algorithm that can result in deadlock. We correct this bug and develop methods to overcome the performance impact of deadlock prevention. We evaluate the performance of the new algorithm and prove that it is free from deadlock. 1

