Results 1  10
of
34
A class of parallel tiled linear algebra algorithms for multicore architectures
"... Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a ..."
Abstract

Cited by 169 (58 self)
 Add to MetaCart
(Show Context)
Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. 1
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract

Cited by 81 (41 self)
 Add to MetaCart
(Show Context)
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
Towards dense linear algebra for hybrid gpu accelerated manycore systems
 Parallel Computing
"... a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with t ..."
Abstract

Cited by 67 (20 self)
 Add to MetaCart
(Show Context)
a b s t r a c t We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to develop efficient dense linear algebra algorithms for these systems. We illustrate the main ideas with the development of a hybrid LU factorization algorithm where we split the computation over a multicore and a graphics processor, and use particular techniques to reduce the amount of pivoting and communication between the hybrid components. This results in an efficient algorithm with balanced use of a multicore processor and a graphics processor.
Comparative study of onesided factorizations with multiple software packages on multicore hardware. LAPACK Working Note 217
, 2009
"... The emergence and continuing use of multicore architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) is a ..."
Abstract

Cited by 34 (19 self)
 Add to MetaCart
The emergence and continuing use of multicore architectures require changes in the existing software and sometimes even a redesign of the established algorithms in order to take advantage of now prevailing parallelism. The Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) is a project that aims to achieve both high performance and portability across a wide range of multicore architectures. We present in this paper a comparative study of PLASMA’s performance against established linear algebra packages (LAPACK and ScaLAPACK), against new approaches at parallel execution (Task Based Linear Algebra Subroutines – TBLAS), and against equivalent commercial software offerings (MKL, ESSL and PESSL). Our experiments were conducted on onesided linear algebra factorizations (LU, QR and Cholesky) and used multicore architectures (based on Intel Xeon EMT64 and IBM Power6). The performance results show improvements brought by new algorithms on up to 32 cores – the largest multicore system we could access. 1
Design of multicore sparse Cholesky factorization using DAGs
, 2010
"... The rapid emergence of multicore machines has led to the need to design new algorithms that are efficient on these architectures. Here, we consider the solution of sparse symmetric positivedefinite linear systems by Cholesky factorization. We were motivated by the successful division of the comput ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
The rapid emergence of multicore machines has led to the need to design new algorithms that are efficient on these architectures. Here, we consider the solution of sparse symmetric positivedefinite linear systems by Cholesky factorization. We were motivated by the successful division of the computation in the dense case into tasks on blocks and use of a task manager to exploit all the parallelism that is available between these tasks, whose dependencies may be represented by a directed acyclic graph (DAG). Our sparse algorithm is built on the assembly tree and subdivides the work at each node into tasks on blocks of the Cholesky factor. The dependencies between these tasks may again be represented by a DAG. To limit memory requirements, blocks are updated directly rather than through generatedelement matrices. Our algorithm is implemented within a new efficient and portable solver HSL MA87. It is written in Fortran 95 plus OpenMP and is available as part of the software library HSL. Using problems arising from a range of applications, we present experimental results that support our design choices and demonstrate that HSL MA87 obtains good serial and parallel times on our 8core test machines. Comparisons are made with existing modern solvers and show that HSL MA87 performs well, particularly in the case of very large problems.
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPUbased computing
, 2010
"... ..."
Enabling and Scaling Matrix Computations on Heterogeneous MultiCore and MultiGPU Systems
, 2012
"... We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multiGPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributedmemory machine, and use a heterogeneous multilevel block cyclic ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
(Show Context)
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multiGPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributedmemory machine, and use a heterogeneous multilevel block cyclic distribution method to allocate data to the host and multiple GPUs to minimize communication. We design heterogeneous algorithms with hybrid tiles to accommodate the processor heterogeneity, and introduce an autotuning method to determine the hybrid tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our approach is designed for achieving four objectives: a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our experiments on a compute node (with two Intel Westmere hexacore CPUs and three Nvidia Fermi GPUs), as well as on up to 100 compute nodes on the Keeneland system [31], demonstrate great scalability, good load balancing, and efficiency of our approach.
FEAST – Realisation of hardwareoriented Numerics for HPC simulations with Finite Elements, Concurrency and Computation: Practice and Experience 22 (6) (2010) 2247–2265, doi: \bibinfo{doi}{10.1002/cpe.1584
"... FEAST (Finite Element Analysis & Solutions Tools) is a Finite Element based solver toolkit for the simulation of PDE problems on parallel HPC systems which implements the concept of ‘hardwareoriented numerics’, a holistic approach aiming at optimal performance for modern numerics. In this paper ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
FEAST (Finite Element Analysis & Solutions Tools) is a Finite Element based solver toolkit for the simulation of PDE problems on parallel HPC systems which implements the concept of ‘hardwareoriented numerics’, a holistic approach aiming at optimal performance for modern numerics. In this paper, we describe this concept and the modular design which enables applications built on top of FEAST to execute efficiently, without any code modifications, on commodity based clusters, the NEC SX 8 and GPUaccelerated clusters. We demonstrate good performance and weak and strong scalability for the prototypical Poisson problem and more challenging applications from solid mechanics and fluid dynamics. 1
Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated FineGrained and MemoryAware Kernels
"... This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a twostage approach, where the tile matrix is first reduced to symmetr ..."
Abstract

Cited by 12 (8 self)
 Add to MetaCart
(Show Context)
This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a twostage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging tradeoff between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating finegrained and memoryaware computational tasks during both stages, while sustaining the applications overall high performance. A dynamic runtime environment system then schedules the different tasks in an outoforder fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50fold and 12fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexacore AMD Opteron multicore sharedmemory system with a matrix size of 24000 × 24000. 1.
Accelerating the reduction to upper Hessenberg form through hybrid GPUbased computing
, 2009
"... We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorit ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are significant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a significant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to twosided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get significant computational acceleration which otherwise may have been impossible.