Results 1  10
of
33
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
"... Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we pre ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
(Show Context)
Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt highlevel algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
"... Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
(Show Context)
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in highperformance computing and the use of grids for speeding up largescale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a stateoftheart dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization – one of the main dense linear algebra kernels – of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topologyaware middleware (QCGOMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid’5000 platform shows that the resulting performance increases linearly with the number of geographical sites on largescale problems (and is in particular consistently higher than ScaLAPACK’s).
TwoStage Tridiagonal Reduction for Dense Symmetric Matrices using Tile Algorithms on Multicore Architectures
"... Abstract—While successful implementations have already been written for onesided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for twosided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and dif ..."
Abstract

Cited by 15 (9 self)
 Add to MetaCart
(Show Context)
Abstract—While successful implementations have already been written for onesided transformations (e.g., QR, LU and Cholesky factorizations) on multicore architecture, getting high performance for twosided reductions (e.g., Hessenberg, tridiagonal and bidiagonal reductions) is still an open and difficult research problem due to expensive memorybound operations occurring during the panel factorization. The processormemory speed gap continues to widen, which has even further exacerbated the problem. This paper focuses on an efficient implementation of the tridiagonal reduction, which is the first algorithmic step toward computing the spectral decomposition of a dense symmetric matrix. The original matrix is translated into a tile layout i.e., a high performance data representation, which substantially enhances data locality. Following a twostage approach, the tile matrix is then transformed into band tridiagonal form using compute intensive kernels. The band form is further reduced to the required tridiagonal form using a leftlooking bulge chasing technique to reduce memory traffic and memory contention. A dependence translation layer associated with a dynamic runtime system allows for scheduling and overlapping tasks generated from both stages. The obtained tile tridiagonal reduction significantly outperforms the stateoftheart numerical libraries (10X against multithreaded LAPACK with optimized MKL BLAS and 2.5X against the commercial numerical software Intel MKL) from medium to large matrix sizes.
Enabling and scaling matrix computations on heterogeneous multicore and multigpu systems
 in 6th ACM International Conference on Supercomputing, (ICS 2012
"... We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multiGPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributedmemory machine, and use a heterogeneous multilevel block cyclic ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multiGPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributedmemory machine, and use a heterogeneous multilevel block cyclic distribution method to allocate data to the host and multiple GPUs to minimize communication. We design heterogeneous algorithms with hybrid tiles to accommodate the processor heterogeneity, and introduce an autotuning method to determine the hybrid tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our approach is designed for achieving four objectives: a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our experiments on a compute node (with two Intel Westmere hexacore CPUs and three Nvidia Fermi GPUs), as well as on up to 100 compute nodes on the Keeneland system [31], demonstrate great scalability, good load balancing, and efficiency of our approach.
Parallel Reduction to Condensed Forms for Symmetric Eigenvalue Problems using Aggregated FineGrained and MemoryAware Kernels
"... This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a twostage approach, where the tile matrix is first reduced to symmetr ..."
Abstract

Cited by 12 (8 self)
 Add to MetaCart
(Show Context)
This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a twostage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging tradeoff between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating finegrained and memoryaware computational tasks during both stages, while sustaining the applications overall high performance. A dynamic runtime environment system then schedules the different tasks in an outoforder fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50fold and 12fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexacore AMD Opteron multicore sharedmemory system with a matrix size of 24000 × 24000. 1.
Scaling LAPACK Panel Operations Using Parallel Cache Assignment
 PPOPP 2010
, 2010
"... In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
In LAPACK many matrix operations are cast as block algorithms which iteratively process a panel using an unblocked algorithm and then update a remainder matrix using the high performance Level 3 BLAS. The Level 3 BLAS have excellent weak scaling, but panel processing tends to be bus bound, and thus scales with bus speed rather than the number of processors (p). Amdahl’s law therefore ensures that as p grows, the panel computation will become the dominant cost of these LAPACK routines. Our contribution is a novel parallel cache assignment approach which we show scales well with p. We apply this general approach to the QR and LU panel factorizations on two commodity 8core platforms with very different cache structures, and demonstrate superlinear panel factorization speedups on both machines. Other approaches to this problem demand complicated reformulations of the computational approach, new kernels to be tuned, new mathematics, an inflation of the highorder flop count, and do not perform as well. By demonstrating a straightforward alternative that avoids all of these contortions and scales with p, we address a critical stumbling block for dense linear algebra in the age of massive parallelism.
Achieving Numerical Accuracy and High Performance using Recursive Tile LU Factorization
"... Abstract. The LU factorization is an important numerical algorithm for solving ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
(Show Context)
Abstract. The LU factorization is an important numerical algorithm for solving
LU Factorization for Acceleratorbased Systems
"... Abstract—Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. In this paper, we present the design and implementation of an LU factorization using tile algorithm that can fully exploit the potential of such p ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. In this paper, we present the design and implementation of an LU factorization using tile algorithm that can fully exploit the potential of such platforms in spite of their complexity. We use a methodology derived from previous work on Cholesky and QR factorizations. Our contributions essentially consist of providing new CPU/GPU hybrid LU kernels, studying the impact on performance of the looking variants as well as the storage layout in presence of pivoting, tuning the kernels for two different machines composed of multiple recent NVIDIA Tesla S1070 (four GPUs total) and Fermibased S2050 GPUs (three GPUs total), respectively. The hybrid tile LU asymptotically achieves 1 Tflop/s in single precision on both hardwares. The performance in double precision arithmetic reaches 500 Gflop/s on the Fermibased system, twice faster than the old GPU generation of Tesla S1070. We also discuss the impact of the number of tiles on the numerical stability. We show that the numerical results of the tile LU factorization will be accurate enough for most applications as long as the computations are performed in double precision arithmetic.
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures
 In Proceedings of the 9th international conference on High
, 2011
"... ar ..."
(Show Context)
Datadriven execution of fast multipole methods, arXiv preprint arXiv:1203.0889
, 2012
"... Abstract. Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on nextgeneration supercomputers. Their most common application is to accelerate Nbody problems, but they can also be used to solve boundary in ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
Abstract. Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on nextgeneration supercomputers. Their most common application is to accelerate Nbody problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, loadbalancing becomes a nontrivial question. A common strategy for loadbalancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on datadriven execution to efficiently tackle this challenging loadbalancing problem. The core idea consists of breaking the most timeconsuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a Directed Acyclic Graph (DAG) where nodes represent tasks, and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the QUARK runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an outoforder execution. The performance results of the datadriven FMM execution outperform the previous strategy and show linear speedup on a quadsocket quadcore Intel Xeon system. 1