Results 1 - 10
of
19
High-Performance Parallel Programming in Java: Exploiting Native Libraries
, 1998
"... With most of today's fast scientific software written in Fortran and C, Java has a lot of catching up to do. In this paper we discuss how new Java programs can capitalize on high-performance libraries for other languages. With the help of a tool we have automatically created Java bindings for severa ..."
Abstract
-
Cited by 67 (3 self)
- Add to MetaCart
With most of today's fast scientific software written in Fortran and C, Java has a lot of catching up to do. In this paper we discuss how new Java programs can capitalize on high-performance libraries for other languages. With the help of a tool we have automatically created Java bindings for several standard libraries: MPI, BLAS, BLACS, PBLAS, ScaLAPACK. Performance results are presented for Java versions of two benchmarks from the NPB and PARKBENCH suites on an IBM SP2 distributed memory machine using JDK and IBM's high-performance Java compiler. The results confirm that fast parallel computing in Java is indeed possible.
Experiments with Scheduling Using Simulated Annealing in a Grid Environment
, 2002
"... Generating high quality schedules for distributed applications on a Computational Grid is a challenging problem. ..."
Abstract
-
Cited by 38 (5 self)
- Add to MetaCart
Generating high quality schedules for distributed applications on a Computational Grid is a challenging problem.
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
- INTERN. J. HIGH PERF. COMP. APPLICATIONS
, 2005
"... This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to th ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advantage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the
Automatic Binding of Native Scientific Libraries to Java
- In Scientific Computing in Object-Oriented Parallel Environments (New
, 1997
"... . We have created a tool for automatically binding existing native C libraries to Java. With the aid of the Java--to--C Interface generating tool (JCI) the abundance of existing C and Fortran-77 scientific libraries can more easily be made available to Java programmers. We have applied JCI to bind M ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
. We have created a tool for automatically binding existing native C libraries to Java. With the aid of the Java--to--C Interface generating tool (JCI) the abundance of existing C and Fortran-77 scientific libraries can more easily be made available to Java programmers. We have applied JCI to bind MPI, PBLAS, ScaLAPACK and other libraries to Java. The approach of automatic binding ensures both portability across different platforms and full compatibility with the library specifications. In order to evaluate the performance of Java code which accesses native libraries, we have run Java versions of parallel benchmarks from the ParkBench suite. The results obtained on a distributed--memory IBM SP2 machine demonstrate the viability of our approach. 1 Introduction As a programming language, Java has the basic qualities needed for writing high--performance applications. With the maturing of compilation technology, such applications written in Java will doubtlessly appear. Since Java is a fa...
DAGuE: A generic distributed DAG engine for high performance computing
, 2010
"... The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for archit ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be represented as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a Linear Algebra factorization as a use case. I.
A Case for Asynchronous Active Memories
- Proc. ISCA 2000 Solving the Memory Wall Problem Workshop
, 2000
"... One of the biggest challenges facing modern computer architects is overcoming the memory wall. Technology trends dictate that the gap between processor and memory performance is widening. Even though good cache behavior mitigates this problem to some extent, memory latency remains a critical perform ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
One of the biggest challenges facing modern computer architects is overcoming the memory wall. Technology trends dictate that the gap between processor and memory performance is widening. Even though good cache behavior mitigates this problem to some extent, memory latency remains a critical performance bottleneck in modern high-performance processors. Although current high-speed systems have improved memory bandwidth by using heavily pipelined clocked architectures, these techniques do not improve memory latency and they burden the memory controller designer with a number of complex timing constraints. We propose to tackle several challenges facing modern memory system designers by studying asynchronous active memories|pipelined memory systems that do not use clocks for their operation. We believe that our approach addresses the shortcomings in current designs and provides the bene ts of simple controller design, average-case performance, and support for non-uniform memory access times. The latter bene t is the key to transparent support for active memories. 1.
Transporting Distributed BLAS to the Fujitsu AP3000 and VPP-300
, 1998
"... The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP-3 ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
The DBLAS Distributed BLAS Library is a portable version of parallel BLAS that has been highly tuned for the Fujitsu AP1000 and AP+. In this paper, we describe performance enhancements made for two very different high performance distributed memory platforms, the Fujitsu AP3000 and the Fujitsu VPP-300. Even with the provision of highly tuned (vendor-supplied) serial BLAS implementations, attention must be given to cell computation speed issues, since serial BLAS does not supply a local matrix transpose routine (which is needed in many places), nor does it supply routines to adequately handle the triangular matrices which arise in the parallel context. We will describe the differing principles used on the UltraSPARC and VPP-300 nodes to optimise memory access patterns for the local matrix transpose operation and the large matrix multiply. The former uses partitioning methods which can yield a factor of 3-4 improvement of naive methods. The latter simultaneously optimizes usage of two leve...
Building and using a fault tolerant mpi implementation
- International Journal of High Performance Computing Applications
, 2004
"... In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified function ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FT-MPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.
A Fault-Tolerant Communication Library for Grid Environments
, 2003
"... With increasing numbers of processors and applications running in virtual Grid environments, application level fault-tolerance is getting more of an important issue. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communicatio ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
With increasing numbers of processors and applications running in virtual Grid environments, application level fault-tolerance is getting more of an important issue. This paper presents the semantics of a fault tolerant version of the Message Passing Interface, the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well defined way. The architecture of FT-MPI, an implementation of MPI using the semantics presented above as well as some tools supporting end-users during the application development step with FT-MPI are presented. Furthermore, a performance comparison of FT-MPI to the most relevant MPI-libraries for point-to-point benchmarks and the High Performance Linpack Benchmark, is shown.

