Results 1 - 10
of
13
Early Experiments with the OpenMP/MPI Hybrid Programming Model
"... Abstract. The paper describes some very early experiments on new architectures that support the hybrid programming model. Our results are promising in that OpenMP threads interact with MPI as desired, allowing OpenMP-agnostic tools to be used. We explore three environments: a “typical ” Linux cluste ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
(Show Context)
Abstract. The paper describes some very early experiments on new architectures that support the hybrid programming model. Our results are promising in that OpenMP threads interact with MPI as desired, allowing OpenMP-agnostic tools to be used. We explore three environments: a “typical ” Linux cluster, a new large-scale machine from SiCortex, and the new IBM BG/P, which have quite different compilers and runtime systems for both OpenMP and MPI. We look at a few simple, diagnostic programs, and one “application-like ” test program. We demonstrate the use of a tool that can examine the detailed sequence of events in a hybrid program and illustrate that a hybrid computation might not always proceed as expected. 1
Shared Memory, Message Passing, and Hybrid Merge Sorts for Standalone and Clustered SMPs
"... Abstract – While merge sort is well-understood in parallel algorithms theory, relatively little is known of how to implement parallel merge sort with mainstream parallel programming platforms, such as OpenMP and MPI, and run it on mainstream SMP-based systems, such as multi-core computers and multi- ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Abstract – While merge sort is well-understood in parallel algorithms theory, relatively little is known of how to implement parallel merge sort with mainstream parallel programming platforms, such as OpenMP and MPI, and run it on mainstream SMP-based systems, such as multi-core computers and multi-core clusters. This is misfortunate because merge sort is not only a fast and stable sort algorithm, but it is also an easy to understand and popular representative of the rich class of divide-and-conquer methods; hence better understanding of merge sort parallelization can contribute to better understanding of divide-and-conquer parallelization in general. In this paper, we investigate three parallel merge-sorts: shared memory merge sort that runs on SMP systems with OpenMP; message-passing merge sort that runs on computer clusters with MPI; and combined hybrid merge sort, with both OpenMP and MPI, that runs on clustered SMPs. We have experimented with our parallel merge sorts on a dedicated Rocks SMP cluster and on a virtual SMP luster in the Amazon Elastic Compute Cloud. In our experiments, shared memory merge sort with OpenMP has achieved best speedup. We believe that we are the first ones to concurrently experiment with- and compare – shared memory, message passing, and hybrid merge sort. Our results can help in the parallelization of specific practical merge sort routines and, even more important, in the practical parallelization of other divide-and-conquer algorithms for mainstream SMP-based systems.
HYBRID MESSAGE-PASSING AND SHARED-MEMORY PROGRAMMING IN A MOLECULAR DYNAMICS APPLICATION ON MULTICORE CLUSTERS
"... Hybrid programming, whereby shared-memory and mes-sage-passing programming techniques are combined within a single parallel application, has often been dis-cussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Hybrid programming, whereby shared-memory and mes-sage-passing programming techniques are combined within a single parallel application, has often been dis-cussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model brings any perform-ance benefits for clusters based on multicore processors. A molecular dynamics application has been parallelized using both MPI and hybrid MPI/OpenMP programming models. The performance of this application has been examined on two high-end multicore clusters using both Infiniband and Gigabit Ethernet interconnects. The hybrid model has been found to perform well on the higher-latency Gigabit Ethernet connection, but offers no perform-ance benefit on low-latency Infiniband interconnects. The changes in performance are attributed to the differing com-munication profiles of the hybrid and MPI codes. Key words: message passing, shared memory, multicore, clusters, hybrid programming 1
Speeding up distributed mapreduce applications using hardware accelerators
- in 38th International Conference on Parallel Processing (ICPP
, 2009
"... Abstract-In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogeneous at multiple levels: from asymmetric processors, to different system architectures, operating systems and networks. Exploiting the intrinsic multi-level parallelism present in such a comp ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract-In an attempt to increase the performance/cost ratio, large compute clusters are becoming heterogeneous at multiple levels: from asymmetric processors, to different system architectures, operating systems and networks. Exploiting the intrinsic multi-level parallelism present in such a complex execution environment has become a challenging task using traditional parallel and distributed programming models. As a result, an increasing need for novel approaches to exploiting parallelism has arisen in these environments. MapReduce is a data-driven programming model originally proposed by Google back in 2004 as a flexible alternative to the existing models, specially devoted to hiding the complexity of both developing and running massively distributed applications in large compute clusters. In some recent works, the MapReduce model has been also used to exploit parallelism in other non-distributed environments, such as multi-cores, heterogeneous processors and GPUs. In this paper we introduce a novel approach for exploiting the heterogeneity of a Cell BE cluster linking an existing MapReduce runtime implementation for distributed clusters and one runtime to exploit the parallelism of the Cell BE nodes. The novel contribution of this work is the design and evaluation of a MapReduce execution environment that effectively exploits the parallelism existing at both the Cell BE cluster level and the heterogeneous processors level.
A Case for Kernel Level Implementation of Inter Process Communication Mechanisms
"... Abstract Distributed systems must provide some kind of inter process communication (IPC) mechanisms to enable communication between local and especially geographically dispersed and physically distributed processes. These mechanisms may be implemented at different levels of distributed systems na ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Distributed systems must provide some kind of inter process communication (IPC) mechanisms to enable communication between local and especially geographically dispersed and physically distributed processes. These mechanisms may be implemented at different levels of distributed systems namely at application level, library level, operating system interface level, or kernel level. Upper level implementations are intuitively simpler to develop but are less efficient. This paper provides hard evidence on this intuition. It considers two renowned IPC mechanisms, one implemented at library level, called MPI, and the other implemented at kernel level, called DIPC. It shows that the time taken to calculate the Pi number by a distributed system that uses MPI to program and run the calculation of Pi number in parallel is on average 35% slower than by the same distributed system that uses DIPC to program and run the calculation of Pi number in parallel. It is concluded that if distributed systems are to become an appropriate platform for high performance scientific computing of all kinds, it is necessary to try harder and implement IPC mechanisms at kernel level, even ignoring so many other factors in favor of kernel level implementations like safety, privilege, reliability, and primitiveness.
the Computation, Assembly and Solution Stages in Finite Element Codes
, 2016
"... Using hybrid parallel programming techniques for the computation, assembly and solution stages in finite element codes Article in Latin American applied research Pesquisa aplicada latino americana = Investigacio ́n aplicada ..."
Abstract
- Add to MetaCart
(Show Context)
Using hybrid parallel programming techniques for the computation, assembly and solution stages in finite element codes Article in Latin American applied research Pesquisa aplicada latino americana = Investigacio ́n aplicada
MPI at Exascale
"... With petascale systems already available, researchers are devoting their attention to the issues needed to reach the next major level in performance, namely, exascale. Explicit message passing using the Message Passing Interface (MPI) is the most commonly used model for programming petascale systems ..."
Abstract
- Add to MetaCart
(Show Context)
With petascale systems already available, researchers are devoting their attention to the issues needed to reach the next major level in performance, namely, exascale. Explicit message passing using the Message Passing Interface (MPI) is the most commonly used model for programming petascale systems today. In this paper, we investigate what is needed to enable MPI to scale to exascale, both in the MPI specification and in MPI implementations, focusing on issues such as memory consumption and performance. We also present results of experiments related to MPI memory consumption at scale on the IBM Blue Gene/P at Argonne National Laboratory. 1
Applications
"... Abstract – Monte Carlo simulations are extensively used in wide of application areas. Although the basic framework of these is simple, they can be extremely computationally intensive. In this paper we present a software framework partitions a generic Monte Carlo simulation into two asynchronous part ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract – Monte Carlo simulations are extensively used in wide of application areas. Although the basic framework of these is simple, they can be extremely computationally intensive. In this paper we present a software framework partitions a generic Monte Carlo simulation into two asynchronous parts: (a) a threaded, GPU-accelerated pseudo-random number generator (or producer), and (b) a multi-threaded Monte Carlo application (or consumer). The advantage of this approach is that this software framework can be directly used in most any Monte Carlo application without requiring application-specific programming of the GPU. We present an analysis of the performance of this software framework. Finally, we compare this analysis to experimental results obtained from our implementation of this software framework.
unknown title
"... Abstract – While merge sort is well-understood in parallel algorithms theory, relatively little is known of how to implement parallel merge sort with mainstream parallel programming platforms, such as OpenMP and MPI, and run it on mainstream SMP-based systems, such as multi-core computers and multi- ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract – While merge sort is well-understood in parallel algorithms theory, relatively little is known of how to implement parallel merge sort with mainstream parallel programming platforms, such as OpenMP and MPI, and run it on mainstream SMP-based systems, such as multi-core computers and multi-core clusters. This is misfortunate because merge sort is not only a fast and stable sort algorithm, but it is also an easy to understand and popular representative of the rich class of divide-and-conquer methods; hence better understanding of merge sort parallelization can contribute to better understanding of divide-and-conquer parallelization in general. In this paper, we investigate three parallel merge-sorts: shared memory merge sort that runs on SMP systems with OpenMP; message-passing merge sort that runs on computer clusters with MPI; and combined hybrid merge sort, with both OpenMP and MPI, that runs on clustered SMPs. We have experimented with our parallel merge sorts on a dedicated Rocks SMP cluster and on a virtual SMP luster in the Amazon Elastic Compute Cloud. In our experiments, shared memory merge sort with OpenMP has achieved best speedup. We believe that we are the first ones to concurrently experiment with- and compare – shared memory, message passing, and hybrid merge sort. Our results can help in the parallelization of specific practical merge sort routines and, even more important, in the practical parallelization of other divide-and-conquer algorithms for mainstream SMP-based systems.
1Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with
"... Abstract—With the increasing prominence of many-core archi-tectures and decreasing per-core resources on large supercomput-ers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—With the increasing prominence of many-core archi-tectures and decreasing per-core resources on large supercomput-ers, a number of applications developers are investigating the use of hybrid MPI+threads programming to utilize computational units while sharing memory. An MPI-only model that uses one MPI process per system core is capable of effectively utilizing the processing units, but it fails to fully utilize the memory hierarchy and relies on fine-grained internode communication. Hybrid MPI+threads models, on the other hand, can handle intranode parallelism more effectively and alleviate some of the overheads associated with internode communication by allowing more coarse-grained data movement between address spaces. The hybrid model, however, can suffer from locking and memory consistency overheads associated with data sharing. In this paper, we use a distributed implementation of the breadth-first search algorithm in order to understand the per-formance characteristics of MPI-only and MPI+threads models at scale. We start with a baseline MPI-only implementation and propose MPI+threads extensions where threads independently communicate with remote processes while cooperating for local computation. We demonstrate how the coarse-grained commu-nication of MPI+threads considerably reduces time and space overheads that grow with the number of processes. At large scale, however, these overheads constitute performance barriers for both models and require fixing the root causes, such as the excessive polling for communication progress and inefficient global synchronizations. To this end, we demonstrate various techniques to reduce such overheads and show performance improvements on up to 512K cores of a Blue Gene/Q system. I.