Results 11 - 20
of
37
Nonuniformly communicating noncontiguous data: A case study with PETSc and MPI
- In 21th International Parallel and Distributed Processing Symposium (IPDPS 2007
, 2007
"... Due to the complexity associated with developing parallel applications, scientists and engineers rely on highlevel software libraries such as PETSc, ScaLAPACK and PESSL to ease this task. Such libraries assist application developers by providing abstractions for mathematical operations, data represe ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Due to the complexity associated with developing parallel applications, scientists and engineers rely on highlevel software libraries such as PETSc, ScaLAPACK and PESSL to ease this task. Such libraries assist application developers by providing abstractions for mathematical operations, data representation and management of parallel layouts of the data, while internally using communication libraries such as MPI and PVM. With high-level libraries managing data layout and communication internally, it can be expected that they organize application data suitably for performing the library operations optimally. However, this places additional overhead on the underlying communication library by making the data layout noncontiguous in memory and communication volumes (data transferred to each process) nonuniform. In this paper, we analyze the overheads associated with these two aspects (noncontiguous data layouts and nonuniform communication volumes) in the context of the PETSc software toolkit over the MPI communication library. We describe the issues with the current approaches used by MPICH2 (an implementation of MPI), propose different approaches to handle these issues and evaluate these approaches with microbenchmarks as well as an application over the PETSc software library. Our experimental results demonstrate close to an order of magnitude improvement in the per-∗ This work was supported by the Mathematical, Information,
A High Performance Message Passing System for Network of Workstations
"... . With the proliferation of Network of Workstations (NOW) environment, there has been a great demand for a high performance message passing system to implement High Performance Distributed Computing (HPDC) applications over NOW environment. NYNET (ATM wide area network testbed in New York state) Com ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
. With the proliferation of Network of Workstations (NOW) environment, there has been a great demand for a high performance message passing system to implement High Performance Distributed Computing (HPDC) applications over NOW environment. NYNET (ATM wide area network testbed in New York state) Communication System (NCS) is a multithreaded message passing system developed at Syracuse University that provides low-latency and highthroughput communication services over the Asynchronous Transfer Mode (ATM) based HPDC environment. NCS provides High Performance Application Communication Interface (HPI) to support applications that demand high-throughput and low-latency communication services. This paper outlines the general architecture of NCS and presents the implementation approach of NCS HPI over an ATM network. This interface has been developed by modifying Fore Systems' ATM Application Programming Interface (API) and its device driver. NCS HPI uses read/write trap routines to bypass t...
Accelerated Waveform Methods for Parallel Transient Simulation of Semiconductor Devices
- In Proceedings of the International Conference on Computer-Aided Design
, 1996
"... Simulating transients in semiconductor devices involves numerically solving the time-dependent drift-diffusion equations, usually in two or three space dimensions. Because of the computation cost of these simulations, methods that perform careful domain decomposition so as to exploit parallel proces ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Simulating transients in semiconductor devices involves numerically solving the time-dependent drift-diffusion equations, usually in two or three space dimensions. Because of the computation cost of these simulations, methods that perform careful domain decomposition so as to exploit parallel processing have received much recent attention. In this paper, we describe using accelerated waveform relaxation (WR) to perform parallel device transient simulation using both clusters of workstations and the IBM SP-2. The accelerated WR algorithms are compared to pointwise direct and iterative methods, and it is shown that the accelerated WR method is competitive on a single processor. In addition, it is shown that with a domain decomposition chosen for rapid iterative method convergence rather than parallel efficiency, the pointwise methods parallelize poorly but the WR mcthod achieves near linear speedup (with respect to the number of processors) on the IBM SP-2.
Running Highly-Coupled Parallel Applications in a Computational Grid
- In: Proceedings of the 22th Brazilian Symposium on Computer Networks
, 2004
"... InteGrade is an object-oriented grid middleware infrastructure whose goal is to leverage existing computational resources in organizations. Rather than relying on dedicated hardware such as reserved clusters, InteGrade focuses on using user desktops, machines in instructional laboratories, shared ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
InteGrade is an object-oriented grid middleware infrastructure whose goal is to leverage existing computational resources in organizations. Rather than relying on dedicated hardware such as reserved clusters, InteGrade focuses on using user desktops, machines in instructional laboratories, shared workstations, as well as dedicated clusters.
Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments
- In SBAC-PAD’05: The 17th International Symposium on Computer Architecture and High Performance Computing (Rio de Janeiro
, 2005
"... Executing long-running parallel applications in Opportunistic Grid environments composed of heterogeneous, shared user workstations, is a daunting task. Machines may fail, become unaccessible, or may switch from idle to busy unexpectedly, compromising the execution of applications. A mechanism for f ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Executing long-running parallel applications in Opportunistic Grid environments composed of heterogeneous, shared user workstations, is a daunting task. Machines may fail, become unaccessible, or may switch from idle to busy unexpectedly, compromising the execution of applications. A mechanism for fault-tolerance that supports these heterogeneous architectures is an important requirement for such a system. In this paper, we describe the support for fault-tolerant execution of BSP parallel applications on heterogeneous, shared workstations. A precompiler instruments application source code to save state periodically into checkpoint files. In case of failure, it is possible to recover the stored state from these files. Generated checkpoints are portable and can be recovered in a machine of different architecture, with data representation conversions being performed at recovery time. The precompiler also modifies BSP parallel applications to allow execution on a Grid composed of machines with different architectures. We implemented a monitoring and recovering infrastructure in the InteGrade Grid middleware. Experimental results evaluate the overhead incurred and the viability of using this approach in a Grid environment. 1.
DiscFinder: A Data-Intensive Scalable Cluster Finder for Astrophysics
"... DiscFinder is a scalable approach for identifying large-scale astronomical structures, such as galaxy clusters, in massive observation and simulation astrophysics datasets. It is designed to operate on datasets with tens of billions of astronomical objects, even in the case when the dataset is much ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
DiscFinder is a scalable approach for identifying large-scale astronomical structures, such as galaxy clusters, in massive observation and simulation astrophysics datasets. It is designed to operate on datasets with tens of billions of astronomical objects, even in the case when the dataset is much larger than the aggregate memory of compute cluster used for the processing. 1.
CRPC Research into Linear Algebra Software for High Performance Computers
, 1994
"... In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for high-performance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In this paper we look at a number of approaches being investigated in the Center for Research on Parallel Computation (CRPC) to develop linear algebra software for high-performance computers. These approaches are exemplified by the LAPACK, templates, and ARPACK projects. LAPACK is a software library for performing dense and banded linear algebra computations, and was designed to run efficiently on high performance computers. We focus on the design of the distributed memory version of LAPACK, and on an object-oriented interface to LAPACK. The templates project aims at making the task of developing sparse linear algebra software simpler and easier. Reusable software templates are provided that the user can then customize to modify and optimize a particular algorithm, and hence build a more complex applications. ARPACK is a software package for solving large scale eigenvalue problems, and is based on an implicitly restarted variant of the Arnoldi scheme. The paper focuses on issues impact...
Advanced Message Routing for Scalable Distributed Simulations
- Proceedings of the Interservice / Industry Training, Simulation and Education Conference
, 2004
"... On large Linux clusters, scalability is the ability of the program to utilize additional processors in a way that provides a near-linear increase in computational capacity for each node employed. Without scalability, the cluster may cease to be useful after adding a very small number of nodes. The J ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
On large Linux clusters, scalability is the ability of the program to utilize additional processors in a way that provides a near-linear increase in computational capacity for each node employed. Without scalability, the cluster may cease to be useful after adding a very small number of nodes. The Joint Forces Command (JFCOM) Experimentation Directorate (J9) has recently been engaged in Joint Urban Operations (JUO) experiments and counter mortar analyses. Both required scalable codes to simulate over 1 million SAF clutter entities, using hundreds of CPUs. The JSAF application suite, utilizing the redesigned RTI-s communications system, provides the ability to run distributed simulations with sites located across the United States, from
The Grid architectural pattern: Leveraging distributed processing capabilities
- In Article 1 Pattern Languages of Program Design 5 (2005
"... The Grid Middleware pattern describes the software infrastructure necessary to allow the sharing of distributed and potentially heterogeneous computational resources for execution of applications. The middleware deals with several areas, such as resource management, scheduling, and security in an ef ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
The Grid Middleware pattern describes the software infrastructure necessary to allow the sharing of distributed and potentially heterogeneous computational resources for execution of applications. The middleware deals with several areas, such as resource management, scheduling, and security in an efficient and transparent manner. This pattern addresses both the architecture and implementation of the middleware. Example Weather forecasting is a typical computationally intensive problem. Briefly describing, data regarding the area subject to forecasting is split into smaller pieces, each one corresponding to a fraction of the total area. Each fragment is then assigned to a computing resource, typically a node on a cluster, or a processor in a parallel machine. During the computation, nodes need to exchange data, since the forecasting in each of the fragments is influenced by its neighbors. After several hours of processing, the results of the computation are expected to reflect the weather on the given area for a certain period, a few days for example. Figure 1 shows the results of a simulation where the total area was divided in 16 fragments. There are several other applications that can be broken into smaller pieces and require large
A Multithreaded Message-Passing System for High Performance Distributed Computing Applications
- in Proceedings of the IEEE 18th International Conference on Distributed Systems
, 1998
"... High Performance Distributed Computing (HPDC) applications require low-latency and high-throughput communication services and HPDC applications have different Quality of Service (QOS) requirements (e.g., bandwidth requirement, flow/error control algorithms, etc.). The communication services provided ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
High Performance Distributed Computing (HPDC) applications require low-latency and high-throughput communication services and HPDC applications have different Quality of Service (QOS) requirements (e.g., bandwidth requirement, flow/error control algorithms, etc.). The communication services provided by traditional message-passing systems are fixed and thus can not be changed to meet the requirements of different HPDC applications. NYNET (ATM wide area network testbed in New York state) Communication System (NCS) is a multithreaded message-passing system developed at Syracuse University that provides high-performance and flexible communication services. In this paper, we overview the general architecture of NCS and present how NCS communication services are implemented. NCS point-to-point communication is flexible in that users can configure efficient point-to-point primitives by selecting suitable flow control, errror control algorithms, and communication interfaces on a per-connection...

