Results 11 - 20
of
34
Dyn-MPI: Supporting MPI on non dedicated clusters (extended version
, 2003
"... Distributing data is a fundamental problem in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in environments where the participating nodes are not dedicated to a parallel application. We are investigating the data distribution problem in non dedicated ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Distributing data is a fundamental problem in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in environments where the participating nodes are not dedicated to a parallel application. We are investigating the data distribution problem in non dedicated environments in the context of explicit message-passing programs. To address this problem, we have designed and implemented an extension to MPI called Dynamic MPI (Dyn-MPI). The key component of Dyn-MPI is its run-time system, which efficiently and automatically redistributes data on the fly when there are changes in the application or the underlying environment. Dyn-MPI supports efficient memory allocation, precise measurement of system load and computation time, and node removal. Performance results show that programs that use Dyn-MPI execute efficiently in non dedicated environments, including up to almost a three-fold improvement compared to programs that do not redistribute data and a 25 % improvement over standard adaptive load balancing techniques. 1.
Impostors for Parallel Interactive Computer Graphics
, 2004
"... We demonstrate an interactive parallel rendering system based on the impostors technique. Impostors in-crease the latency tolerance of an interactive rendering system, which allows us to use the power of a parallel machine even at high resolutions and framerates. Impostors also decrease the required ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We demonstrate an interactive parallel rendering system based on the impostors technique. Impostors in-crease the latency tolerance of an interactive rendering system, which allows us to use the power of a parallel machine even at high resolutions and framerates. Impostors also decrease the required rendering bandwidth, which makes possible the interactive use of a variety of advanced rendering techniques. These techniques are demonstrated by the interactive high-quality rendering of very large detailed models on large distributed-memory parallel machines. iii To TRUTH, without which everybody would be lying. iv Acknowledgments This work was made possible by the efforts of hundreds of teachers and friends over the span of nearly three decades. I can only mention a few here. Thanks to my adviser, Dr. Kale, who provided me continual support and a steady stream of good ideas. May you always have enough good students to implement your grand designs. Thanks to my committee
A Multi-layer Resource Reconfiguration Framework for Grid Computing
"... Grid is a non-dedicated and dynamic computing environment. Consequently, different programs have to compete with each other for the same resources, and resource availability varies over time. That causes the performance of user programs are degraded and unpredictable. For resolving this problem, we ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Grid is a non-dedicated and dynamic computing environment. Consequently, different programs have to compete with each other for the same resources, and resource availability varies over time. That causes the performance of user programs are degraded and unpredictable. For resolving this problem, we propose a multilayer resource reconfiguration framework for grid computing. As the named, this framework adopts different resource reconfiguration mechanisms for different workloads of resources. We have implemented this framework on a grid-enabled DSM system called Teamster-G. Our experimental result shows that our proposed framework allows Teamster-G not only to fully utilize abundant CPU cycles but also to minimize resource contention between the jobs of resource consumers and those of resource providers. As a result, the job throughput of Teamster-G is effectively increased.
Approaches to architecture-aware parallel scientific computation
- Williams College Department of Computer Science
, 2005
"... Abstract. Modern large-scale scientific computation problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make In ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Modern large-scale scientific computation problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous systems are increasingly common. Processing and communication capabilities can be nonuniform, non-dedicated, transient or unreliable. Even when targeting homogeneous computing environments, each environment may differ in the number of processors per node, the relative costs of computation, communication, and memory access, and the availability of programming paradigms and software tools. Architecture-aware computation requires knowledge of the computing environment and software performance characteristics, and tools to make use of this knowledge. These challenges may be addressed by compilers, low-level tools, dynamic load balancing or solution procedures, middleware layers, high-level software development techniques, and choice of programming languages and paradigms. Computation and communication may be reordered. Data or computation may be replicated or a load imbalance may be tolerated to avoid costly communication. This paper samples a variety of approaches to architecture-aware parallel computation.
Malleable applications for scalable high performance computing
, 2007
"... Iterative applications are known to run as slow as their slowest computational component. This paper introduces malleability, a new dynamic reconfiguration strategy to overcome this limitation. Malleability is the ability to dynamically change the data size and number of computational entities in a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Iterative applications are known to run as slow as their slowest computational component. This paper introduces malleability, a new dynamic reconfiguration strategy to overcome this limitation. Malleability is the ability to dynamically change the data size and number of computational entities in an application. Malleability can be used by middleware to autonomously reconfigure an application in response to dynamic changes in resource availability in an architecture-aware manner, allowing applications to optimize the use of multiple processors and diverse memory hierarchies in heterogeneous environments. The modular Internet Operating System (IOS) was extended to reconfigure applications autonomously using malleability. Two different iterative applications were made malleable. The first is used in astronomical modeling, and representative of maximum-likelihood applications was made malleable in the SALSA programming language. The second models the diffusion of heat over a two dimensional object, and is representative of applications such as partial differential equations and some types of distributed simulations. Versions of the heat application were made malleable both in SALSA and MPI. Algorithms for concurrent data redistribution are given for each type of application. Results show that using malleability for reconfiguration is 10 to 100 times faster on the tested environments. The algorithms are
An orchestration language for parallel objects
- In Proceeding of Seventh Workshop on Languages, Compilers, and Run-time Support for Scalable Systems (LCR 04
, 2004
"... Charm++, a parallel object language based on the idea of virtual processors, has attained significant success in efficient parallelization of applications. Requiring the user to only decompose the computation into a large number of objects (“virtual processors ” or VPs), Charm++ empowers its intelli ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Charm++, a parallel object language based on the idea of virtual processors, has attained significant success in efficient parallelization of applications. Requiring the user to only decompose the computation into a large number of objects (“virtual processors ” or VPs), Charm++ empowers its intelligent adaptive runtime system to assign and reassign the objects to processors at runtime. This facility is used to optimize execution, including via dynamic load balancing. Having multiple sets of VPs for distinct parts of a simulation leads to improved modularity and performance. However, it also tends to obscure the global flow of control: One must look at the code of multiple objects to discern how the sets of objects are orchestrated in a given application. In this paper, we present an orchestration notation that allows expression of Charm++ functionality without its fragmented flow of control. 1.
Cluster Survivability with ByzwATCh: A Byzantine Hardware Fault Detector for Parallel Machines with Charm++
, 2006
"... Modern high-performance computing relies heavily on the use of commodity processors arranged together in clusters. These clusters consist of individual nodes (typically off-the-shelf single or dual processor machines) connected together with a high speed interconnect. Using cluster computation has m ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Modern high-performance computing relies heavily on the use of commodity processors arranged together in clusters. These clusters consist of individual nodes (typically off-the-shelf single or dual processor machines) connected together with a high speed interconnect. Using cluster computation has many benefits, but also carries the liability of being failure prone due to the sheer number of components involved. Many effective solutions have been proposed to aid failure recovery in clusters, however, they depend on these failures being detectable. At present, effectively detecting Byzantine faults is an open problem. We describe the operation of ByzwATCh, a module for run-time detecting byzantine hardware errors as part of the Charm++ parallel programming framework.
Automatic Dynamic Load Balancing for a Crack Propagation Application
"... Abstract — Automatic, adaptive load balancing is essential for handling load imbalance that may occur during parallel finite element simulations involving mesh adaptivity, nonlinear material behavior and other localized effects. This paper demonstrates the successful application of a measurement-bas ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — Automatic, adaptive load balancing is essential for handling load imbalance that may occur during parallel finite element simulations involving mesh adaptivity, nonlinear material behavior and other localized effects. This paper demonstrates the successful application of a measurement-based dynamic load balancing concept to the finite element analysis of elasto-plastic wave propagation and dynamic fracture events. The simulations are performed with the aid of a parallel framework for unstructured meshes called ParFUM, which is based on Charm++ and Adaptive MPI (AMPI) and involves migratable user-level threads. The performance was analyzed using Projections, a performance analysis and post factum visualization tool. The bottlenecks to scalability are identified and eliminated using a variety of strategies resulting in performance gains ranging from moderate to highly significant. I.
A FRAMEWORK FOR THE DYNAMIC RECONFIGURATION OF SCIENTIFIC APPLICATIONS IN GRID ENVIRONMENTS
, 2007
"... ..."
Oversubscription on Multicore Processors
"... Abstract: Existing multicore systems already provide deep levels of thread parallelism. Hybrid programming models and composability of parallel libraries are very active areas of research within the scientific programming community. As more applications and libraries become parallel, scenarios where ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract: Existing multicore systems already provide deep levels of thread parallelism. Hybrid programming models and composability of parallel libraries are very active areas of research within the scientific programming community. As more applications and libraries become parallel, scenarios where multiple threads compete for a core are unavoidable. In this paper we evaluate the impact of task oversubscription on the performance of MPI, OpenMP and UPC implementations of the NAS Parallel Benchmarks on UMA and NUMA multisocket architectures. We evaluate explicit thread affinity management against the default Linux load balancing and discuss sharing and partitioning system management techniques. Our results indicate that oversubscription provides beneficial effects for applications running in competitive environments. Sharing all the available cores between applications provides better throughput than explicit partitioning. Modest levels of oversubscription improve system throughput by 27 % and provide better performance isolation of applications from their co-runners: best overall throughput is always observed when applications share cores and each is executed with multiple threads per core. Rather than “resource ” symbiosis, our results indicate that the determining behavioral factor when applications share a system is the granularity of the synchronization operations. 1.

