Results 1 - 10
of
16
Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing
, 2009
"... Abstract. We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching o ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Abstract. We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 × performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are significant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a significant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to two-sided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get significant computational acceleration which otherwise may have been impossible.
Improving communication performance in dense linear algebra
"... via topology aware collectives ..."
Brief Announcement: Lower Bounds on Communication for Sparse Cholesky Factorization of a Model Problem
"... Previous work has shown that a lower bound on the number of words moved between large, slow memory and small, fast memory of size M by any conventional (non-Strassen like) direct linear algebra algorithm (matrix multiply, the LU, Cholesky, QR factorizations,...) is Ω(#flops / p (M)). This holds for ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Previous work has shown that a lower bound on the number of words moved between large, slow memory and small, fast memory of size M by any conventional (non-Strassen like) direct linear algebra algorithm (matrix multiply, the LU, Cholesky, QR factorizations,...) is Ω(#flops / p (M)). This holds for dense or sparse matrices. There are analogous lower bounds for the number of messages, and for parallel algorithms instead of sequential algorithms. Our goal here is to find algorithms that attain these lower bounds on interesting classes of sparse matrices. We focus on matrices for which there is a lower bound on the number of flops of their Cholesky factorization. Our Cholesky lower bounds on communication hold for any possible ordering of the rows and columns of the matrix, and so are globally optimal in this sense. For matrices arising from discretization on two dimensional and three dimensional regular grids, we discuss sequential and parallel algorithms that are optimal in terms of communication. The algorithms turn out to require combining previously known sparse and dense Cholesky algorithms in simple ways.
permission. LU FACTORIZATION WITH PANEL RANK REVEALING PIVOTING AND ITS COMMUNICATION AVOIDING VERSION
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
unknown title
"... Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing ..."
Abstract
- Add to MetaCart
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
OPTIMIZING HALLEY’S ITERATION FOR COMPUTING THE MATRIX POLAR DECOMPOSITION ∗
"... Abstract. We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than 1016, the DWH method needs at most 6 iteration ..."
Abstract
- Add to MetaCart
Abstract. We introduce a dynamically weighted Halley (DWH) iteration for computing the polar decomposition of a matrix, and prove that the new method is globally and asymptotically cubically convergent. For matrices with condition number no greater than 1016, the DWH method needs at most 6 iterations for convergence with the tolerance 10−16. The Halley iteration can be implemented via QR decompositions without explicit matrix inversions. Therefore, it is an inverse free communication friendly algorithm for the emerging multicore and hybrid high performance computing systems. Key words. Polar decomposition, Halley’s iteration, Newton’s iteration, inverse free iterations, QR decomposition, numerical stability AMS subject classifications. 15A15, 15A23, 65F30
Minimizing Communication for Eigenproblems and the Singular Value Decomposition
, 2010
"... Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we ..."
Abstract
- Add to MetaCart
Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In [4] lower bounds were presented on the amount of communication required for essentially all O(n 3)-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs. 1
unknown title
"... This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or sel ..."
Abstract
- Add to MetaCart
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit:
unknown title
, 2012
"... LU factorization with panel rank revealing pivoting and its communication avoiding version ..."
Abstract
- Add to MetaCart
LU factorization with panel rank revealing pivoting and its communication avoiding version

