Results 1  10
of
13
Efficient, Flexible, and Typed Group Communications in Java
, 2002
"... Group communication is a crucial feature for highperformance and Grid computing. While previous works and libraries proposed such a characteristic (e.g. MPI, or objectoriented frameworks), the use of groups imposed specific constraints on programmers  for instance the use of dedicated interfaces ..."
Abstract

Cited by 22 (10 self)
 Add to MetaCart
Group communication is a crucial feature for highperformance and Grid computing. While previous works and libraries proposed such a characteristic (e.g. MPI, or objectoriented frameworks), the use of groups imposed specific constraints on programmers  for instance the use of dedicated interfaces to trigger group communications. We aim at a more flexible mechanism...
Codesign tradeoffs for highperformance, lowpower linear algebra architectures
 IEEE Transactions on Computers
, 2012
"... As technology is reaching physical limits, reducing power consumption is the key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
As technology is reaching physical limits, reducing power consumption is the key issue on our path to sustained performance. In this paper, we study fundamental tradeoffs and limits in efficiency (as measured in energy per operation) that can be achieved for an important class of kernels, namely the level3 Basic Linear Algebra Subrountines (BLAS). It is wellaccepted that specialization is the key to efficiency. This paper establishes a baseline by studying general matrixmatrix multiplication (GEMM) on a variety of custom and generalpurpose CPU and GPU architectures. Our analysis shows that orders of magnitude improvements in efficiency are possible with relatively simple customizations and finetuning of memory hierarchy configurations. We argue that these customizations can be generalized to perform other representative linear algebra subroutines. In addition to indicating the sources of inefficiencies in current CPUs and GPUs, our results show our prototype linear algebra processor (LAP), doubleprecision GEMM (DGEMM) can achieve 600 GFLOPS consuming less than 25 Watts in standard 45nm technology, which is up to 50 × better than CPUs in terms of energy efficiency. 1
Generalizing matrix multiplication for efficient computations on modern computers
 In PPAM
, 2011
"... Abstract. Recent advances in computing allow taking new look at matrix multiplication, where the key ideas are: decreasing interest in recursion, development of processors with thousands (potentially millions) of processing units, and influences from the Algebraic Path Problems. In this context, w ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Recent advances in computing allow taking new look at matrix multiplication, where the key ideas are: decreasing interest in recursion, development of processors with thousands (potentially millions) of processing units, and influences from the Algebraic Path Problems. In this context, we propose a generalized matrixmatrix multiplyadd (MMA) operation and illustrate its usability. Furthermore, we elaborate the interrelation between this generalization and the BLAS standard.
The Optimal Effectiveness Metric for Parallel Application Analysis
 In Special Issue on Parallel Models
, 1998
"... This paper discusses a scalability metric based on the cost effectiveness of parallel algorithms. Unlike other scalability measures, this metric can be used to compare different parallel algorithms and identify specific conditions of problem size and processor allocation that characterize "cros ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
This paper discusses a scalability metric based on the cost effectiveness of parallel algorithms. Unlike other scalability measures, this metric can be used to compare different parallel algorithms and identify specific conditions of problem size and processor allocation that characterize "crossover" points and intervals where one algorithm becomes more cost effective than another. Finally, this paper presents a series of examples to illustrate the measurement methodology in practice. 1 Introduction The measurement of parallel applications is of significant interest to the evaluation and categorization of various parallel algorithms. This paper argues that a useful metric for parallel algorithm analysis should be consistent, quantitative, predictive, and relevant. A metric is consistent if independent researchers analyzing the same algorithm on the same architecture will arrive at similar conclusions. A metric is quantitative if it can be used to quantify the benefit of disparate algo...
On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in DataParallel Hardware Accelerators
"... Abstract—Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can el ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multiported register files in the context of dataparallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture codesign for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2DSIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75x better power and up to 10x better area efficiency compared to traditional SIMD architectures. I.
A Metric for Parallel PolyAlgorithm Design
, 1997
"... This paper discusses a scalability metric based on the cost effectiveness of parallel algorithms. Unlike other scalability measures, this metric can be used to compare different parallel algorithms and identify specific conditions of problem size and processor allocation that characterize "cros ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper discusses a scalability metric based on the cost effectiveness of parallel algorithms. Unlike other scalability measures, this metric can be used to compare different parallel algorithms and identify specific conditions of problem size and processor allocation that characterize "crossover" points and intervals where one algorithm becomes more cost effective than another. Finally, this paper presents a series of examples to illustrate the measurement methodology in practice. 1 Introduction Consider the development of an algorithm that multiplies matrices. Of the many algorithms that might be employed, two of the most popular methods are the naive algorithm and the Strassen algorithm. Asymptotically, the naive algorithm is O(n 3 ) while the Strassen algorithm is O(n 2:81 ). Although the Strassen algorithm is asymptotically better than the naive algorithm, the setup cost of the Strassen algorithm makes it inefficient for small matrices. An optimal algorithm might employ bo...
The Parallel Mathematical Libraries Project (PMLP): Overview, Design Innovations, and Preliminary Results
 In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing
, 1999
"... In this paper, we present a new, parallel, mathematical library suite for sparse matrices. The Parallel Mathematical Libraries Project (PMLP), a joint effort of Intel, Lawrence Livermore National Laboratory, the Russian Federal Nuclear Laboratory (VNIIEF), and Mississippi State University (MSU), ..."
Abstract
 Add to MetaCart
In this paper, we present a new, parallel, mathematical library suite for sparse matrices. The Parallel Mathematical Libraries Project (PMLP), a joint effort of Intel, Lawrence Livermore National Laboratory, the Russian Federal Nuclear Laboratory (VNIIEF), and Mississippi State University (MSU), constitutes a concerted effort to create a supportable, comprehensive "Sparse Objectoriented Mathematical Library Suite." With overall design and software validation work at MSU, most software development and testing at VNIIEF, and logistics and other miscellaneous support provided by LLNL and Intel, this international collaboration brings objectoriented programming techniques and C++ to the task of providing linear and nonlinear algebraicoriented algorithms for scientists and engineers. Language bindings for C, Fortran77, and C++ are provided.
ComponentBased Derivation of a Parallel Stiff ODE Solver Implemented in a Cluster of Computers
, 2000
"... A componentbased methodological approach to derive distributed implementations of parallel ODE solvers is proposed. The proposal is based on the incorporation of explicit constructs for performance polymorphism into a methodology to derive group parallel programs of numerical methods from SPMD modu ..."
Abstract
 Add to MetaCart
(Show Context)
A componentbased methodological approach to derive distributed implementations of parallel ODE solvers is proposed. The proposal is based on the incorporation of explicit constructs for performance polymorphism into a methodology to derive group parallel programs of numerical methods from SPMD modules. These constructs enable the structuring of the derivation process into clearly defined steps, each one associated with a different type of optimization. The approach makes possible to obtain a flexible tuning of a parallel ODE solver for several execution contexts and applications. Following this methodological approach, a relevant parallel numerical scheme for solving stiff ODES has been optimized and implemented on a PC cluster. This numerical scheme is obtained from a Radau IIA Implicit Runge–Kutta method and exhibits a high degree of potential parallelism. Several numerical experiments have been performed by using several test problems with different structural characteristics. These experiments show satisfactory speedup results. KEY WORDS: Componentbased software development; numerical algorithms with multilevel parallelism; parallel linear algebra libraries; stiff ordinary differential equations; distributed memory machines. 1
GENERALIZED MATRIX MULTIPLICATION AND ITS OBJECT ORIENTED MODEL
"... Since the beginning of the 21st century, we observe rapid changes in the area of, broadly understood, computational sciences. One of interesting eects of these changes is the need for reevaluation of the role of dense matrix multiplication. The aim of this paper is twofold. First, to summarize deve ..."
Abstract
 Add to MetaCart
(Show Context)
Since the beginning of the 21st century, we observe rapid changes in the area of, broadly understood, computational sciences. One of interesting eects of these changes is the need for reevaluation of the role of dense matrix multiplication. The aim of this paper is twofold. First, to summarize developments that point toward a need for reconsidering usefulness of matrix multiplication generalized on the basis of the theory of algebraic semirings. Second, to propose generalized matrixmatrix multiplyandupdate (MMU) operation and its object oriented model. Key words: matrix multiplication, algebraic semirings, algebraic path problem AMS subject classications. 65F30, 13A99 1. Introduction. Recently