Results 1 -
4 of
4
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
"... Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we pre ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already
From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming ✩,✩✩
"... In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do ..."
Abstract
- Add to MetaCart
In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers ’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50 % of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels ’ parameter space using search harness.
Author manuscript, published in "25th IEEE International Parallel & Distributed Processing Symposium (2011)" QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
"... sition) of an m × n real matrix A has the form A = QR where Q is an m × m real orthogonal matrix and R is an m×n real upper triangular matrix. If the diagonal entries of R are imposed to be positive, this decomposition is unique when A is non singular. Different structures for A may arise depending ..."
Abstract
- Add to MetaCart
sition) of an m × n real matrix A has the form A = QR where Q is an m × m real orthogonal matrix and R is an m×n real upper triangular matrix. If the diagonal entries of R are imposed to be positive, this decomposition is unique when A is non singular. Different structures for A may arise depending on the applications. The most important structural property is whether the matrix is sparse or dense. The shape of the matrix is usually non square (m ̸ = n). For instance, the first, predominant step of the standard algorithm for solving dense least square problems (as implemented in LAPACK) is the QR factorization of a dense matrix A representing an overdetermined system (m> n). When applied to a square matrix (m = n), the same algorithms become a stable method to solve the corresponding linear system. Although the LU decomposition is usually preferred

