Results 11 - 20
of
38
Strategies to Parallelize ILP Systems
- Proceedings of the 15th International Conference on Inductive Logic Programming (ILP 2005), volume 3625 of LNAI
, 2005
"... Abstract. It is well known by Inductive Logic Programming (ILP) practioners that ILP systems usually take a long time to find valuable models (theories). The problem is specially critical for large datasets, preventing ILP systems to scale up to larger applications. One approach to reduce the execut ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Abstract. It is well known by Inductive Logic Programming (ILP) practioners that ILP systems usually take a long time to find valuable models (theories). The problem is specially critical for large datasets, preventing ILP systems to scale up to larger applications. One approach to reduce the execution time has been the parallelization of ILP systems. In this paper we overview the state-of-the-art on parallel ILP implementations and present work on the evaluation of some major parallelization strategies for ILP. Conclusions about the applicability of each strategy are presented. Key words: Parallelism, Scaling-up 1
An approach to formalization and analysis of message passing libraries
- IN: PROCEEDINGS OF THE 12TH INTL. WORKSHOP ON FORMAL METHODS FOR INDUSTRIAL CRITICAL SYSTEMS (FMICS
, 2007
"... Message passing using libraries implementing the Message Passing Interface (MPI) standard is the dominant communication mechanism in high performance computing (HPC) applications. Yet, the lack of an implementation independent formal semantics for MPI is a huge void that must be filled, especially ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Message passing using libraries implementing the Message Passing Interface (MPI) standard is the dominant communication mechanism in high performance computing (HPC) applications. Yet, the lack of an implementation independent formal semantics for MPI is a huge void that must be filled, especially given the fact that MPI will be implemented on novel hardware platforms in the near future. To help reason about programs that use MPI for communication, we have developed a formal TLA+ semantic definition of the point to point communication operations to augment the existing standard. The proposed semantics includes 42 MPI functions, including all 35 point to point operations, many of which have not been formally modeled previously. We also present a framework to extract models from SPMD-style C programs, so that designers may understand the semantics of MPI by exercising short, yet pithy, communication scenarios written in C/MPI. In this paper, we describe (i) the TLA+ MPI model features, such as handling the explicit memory for each process to facilitate the modeling of C pointers, and some of the widely used MPI operations, (ii) the model extraction framework and the simplifications made to the model that help facilitate explicit-state model checking of formal semantic definitions, (iii) a customized model checker for MPI that performs much faster model checking, and features a dynamic partial-order reduction algorithm whose correctness is directly based on the formal semantics, and (iv) an error trail replay facility in the Visual Studio environment. Our effort has helped identify a few omissions in the MPI reference standard document. These benefits suggest that a formal semantic definition and exploration approach as described here must accompany every future effort in creating parallel and distributed programming libraries.
Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems
- ICS06
, 2006
"... Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can pro ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
Reliability is increasingly becoming a challenge for highperformance computing (HPC) systems with thousands of nodes, such as IBM’s Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This paper contributes a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and singledigit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.
Flexible collective communication tuning architecture applied to open MPI
- In 2006 Euro PVM/MPI
, 2006
"... Abstract. Collective communications are invaluable to modern high performance applications, although most users of these communication patterns do not always want to know their inner most working. The implementation of the collectives are often left to the middle-ware developer such as those providi ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract. Collective communications are invaluable to modern high performance applications, although most users of these communication patterns do not always want to know their inner most working. The implementation of the collectives are often left to the middle-ware developer such as those providing an MPI library. As many of these libraries are designed to be both generic and portable the MPI developers commonly offer internal tuning options suitable only for knowledgeable users that allow some level of customization. The work presented in this paper aims not only to provide a very efficient set of collective operations for use with the Open MPI implementation but also to make the control and tuning of them straightforward and flexible. Additionally this paper demonstrates a novel example of the proposed frameworks flexibility, by dynamically tuning a MPI Alltoallv algorithm during runtime. 2
Design and Implementation of Open MPI over Quadrics/Elan4
"... Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable, and productionquality implementation of MPI-2 [20] interface based on experiences gained from FT-MPI [8], LA-MPI [10], LAM/MPI [28], and MVAPICH [23] projects. Its initial communication architecture is layer ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Open MPI is a project recently initiated to provide a fault-tolerant, multi-network capable, and productionquality implementation of MPI-2 [20] interface based on experiences gained from FT-MPI [8], LA-MPI [10], LAM/MPI [28], and MVAPICH [23] projects. Its initial communication architecture is layered on top of TCP/IP. In this paper, we have designed and implemented Open MPI point-to-point layer on top of a highend interconnect, Quadrics/Elan4[26]. Design challenges related to dynamic process/connection management, utilizing Quadrics RDMA capabilities and supporting asynchronous communication progression are overcome with salient strategies to utilize Quadrics Queued-based Direct Memory Access (QDMA) and Remote Direct Memory Access (RDMA) operations, along with the chained event mechanism. Experimental results indicate that the resulting point-to-point transport layer implementation is able to achieve comparable performance to Quadrics native QDMA operations, from which it is derived. While not taking advantages of Quadrics/Elan4 [26, 2] NIC-based tag matching due to its design requirements, this point-to-point transport layer provides a high performance implementation of MPI-2 [20] compliant message passing over Quadrics/Elan4.
An Event-driven Architecture for MPI Libraries
- In Proceedings of the 2004 Los Alamos Computer Science Institute Symposium
, 2004
"... Existing MPI libraries couple the progress of message transmission or reception with library invocations by the user application. Such coupling allows for simplicity of implementation, but may increase communication latency and waste CPU resources. This paper proposes the addition of an event-driven ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Existing MPI libraries couple the progress of message transmission or reception with library invocations by the user application. Such coupling allows for simplicity of implementation, but may increase communication latency and waste CPU resources. This paper proposes the addition of an event-driven communication thread to make messaging progress in the library separately from the application thread, thus decoupling communication progress from library invocations by the application. The asynchronous event-thread allows messages to be sent and received concurrently with application execution. This technique dramatically improves the responsiveness of the library to network communication. Microbenchmark results show that the time spent waiting for non-blocking receives to complete can be significantly reduced or even eliminated entirely. Application performance as measured by the NAS benchmarks shows an average of 4.5 % performance improvement, with a peak improvement of 9.2%. 1
High performance RDMA protocols in HPC
- In Proceedings of EuroPVM-MPI 2006
, 2006
"... Abstract. Modern network interconnects that leverage Remote Directory Memory Access (RDMA) and OS bypass, such as Infiniband [2], Myrinet [9], and iWARP over TCP [3], can offer significant performance advantages over conventional send/receive network semantics. However, the high performance of RDMA ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. Modern network interconnects that leverage Remote Directory Memory Access (RDMA) and OS bypass, such as Infiniband [2], Myrinet [9], and iWARP over TCP [3], can offer significant performance advantages over conventional send/receive network semantics. However, the high performance of RDMA often comes with hidden costs. RDMA based interconnects generally fail to provide true one-sided semantics, requiring an exchange of information prior to initiating a one-sided RDMA operation. In addition, both the initiator and target must typically preserve the physical to virtual memory mappings during the RDMA operation. This paper describes a unique user-level ‘pipeline ’ protocol that addresses these constraints while avoiding some of the pitfalls of existing techniques. By effectively overlapping the cost of memory registration with RDMA operations this protocol provides good performance even in the absence of memory buffer reuse. This protocol may also take advantage of memory buffers that have already been used in RDMA operations by avoiding the cost of memory registration. Through this approach, bandwidth may be increased up to 67 % when memory buffers are not effectively reused while providing performance equal to that of existing techniques as demonstrated by both Linpack and NPB benchmark results. Several user level protocols are explored using Open MPI’s PML (Point to point messaging layer) and compared/contrasted to this ‘pipeline ’ protocol. 1
Formal specification of MPI 2.0: Case study in specifying a practical concurrent programming API
, 2009
"... We describe the first formal specification of a non-trivial subset of MPI, the dominant communication API in high performance computing. Engineering a formal specification for a nontrivial concurrency API requires the right combination of rigor, executability, and traceability, while also serving as ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We describe the first formal specification of a non-trivial subset of MPI, the dominant communication API in high performance computing. Engineering a formal specification for a nontrivial concurrency API requires the right combination of rigor, executability, and traceability, while also serving as a smooth elaboration of a pre-existing informal specification. It also requires the modularization of reusable specification components to keep the length of the specification in check. Long-lived APIs such as MPI are not usually ‘textbook minimalistic ’ because they support a diverse array of applications, a diverse community of users, and have efficient implementations over decades of computing hardware. We choose the TLA+ notation to write our specifications, and describe how we organized the specification of around 200 of the 300 MPI 2.0 functions. We detail a handful of these functions in this paper, and assess our specification with respect to the aforementioned requirements. We close with a description of possible approaches that may help render the act of writing, understanding, and validating the specifications of concurrency APIs much more productive.
Infiniband scalability in open mpi
- In Proceedings, 20th IEEE International Parallel & Distributed Processing Symposium, 2006. Processing Letters
"... Infiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new open source implementation of the MPI standard targeted for production computing, pr ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Infiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new open source implementation of the MPI standard targeted for production computing, provides several mechanisms to enhance Infiniband scalability. Initial comparisons with MVAPICH, the most widely used Infiniband MPI implementation, show similar performance but with much better scalability characteristics. Specifically, small message latency is improved by up to 10 % in medium/large jobs and memory usage per host is reduced by as much as 300%. In addition, Open MPI provides predictable latency that is close to optimal without sacrificing bandwidth performance. 1
Optical Interconnect Opportunities for Future Server Memory Systems
"... This paper deals with alternative server memory architecture options in multicore CPU generations using optically-attached memory systems. Thanks to its large bandwidth-distance product, optical interconnect technology enables CPUs and local memory to be placed meters away from each other without sa ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper deals with alternative server memory architecture options in multicore CPU generations using optically-attached memory systems. Thanks to its large bandwidth-distance product, optical interconnect technology enables CPUs and local memory to be placed meters away from each other without sacrificing bandwidth. This topologically-local but physically-remote main memory attached via an ultra-high-bandwidth parallel optical interconnect can lead to flexible memory architecture options using low-cost commodity memory technologies. 1

