Results 1  10
of
20
MULTISHIFT VARIANTS OF THE QZ ALGORITHM WITH Aggressive Early Deflation
"... New variants of the QZ algorithm for solving the generalized eigenvalue problem are proposed. An extension of the smallbulge multishift QR algorithm is developed, which chases chains of many small bulges instead of only one bulge in each QZ iteration. This allows the effective use of level 3 BLAS o ..."
Abstract

Cited by 20 (14 self)
 Add to MetaCart
New variants of the QZ algorithm for solving the generalized eigenvalue problem are proposed. An extension of the smallbulge multishift QR algorithm is developed, which chases chains of many small bulges instead of only one bulge in each QZ iteration. This allows the effective use of level 3 BLAS operations, which in turn can provide efficient utilization of high performance computing systems with deep memory hierarchies. Moreover, an extension of the aggressive early deflation strategy is proposed, which can identify and deflate converged eigenvalues long before classic deflation strategies would. Consequently, the number of overall QZ iterations needed until convergence is considerably reduced. As a third ingredient, we reconsider the deflation of infinite eigenvalues and present a new deflation algorithm, which is particularly effective in the presence of a large number of infinite eigenvalues. Combining all these developments, our implementation significantly improves existing implementations of the QZ algorithm. This is demonstrated by numerical experiments with random matrix pairs as well as with matrix pairs arising from various applications.
Block algorithms for reordering standard and generalized Schur forms
 ACM Transactions on Mathematical Software
, 2006
"... Abstract. Block algorithms for reordering a selected set of eigenvalues in a standard or generalized Schur form are proposed. Efficiency is achieved by delaying orthogonal transformations and (optionally) making use of level 3 BLAS operations. Numerical experiments demonstrate that existing algorith ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Abstract. Block algorithms for reordering a selected set of eigenvalues in a standard or generalized Schur form are proposed. Efficiency is achieved by delaying orthogonal transformations and (optionally) making use of level 3 BLAS operations. Numerical experiments demonstrate that existing algorithms, as currently implemented in LAPACK, are outperformed by up to a factor of four. Key words. Schur form, reordering, invariant subspace, deflating subspace. AMS subject classifications. 65F15, 65Y20. 1. Introduction. Applying
The SBR Toolbox  Software for Successive Band Reduction
, 1996
"... this paper. Their singleprecision twins are identical except for a leading "S" instead of "D" in the routine's name and REAL instead of DOUBLE PRECISION scalars and arrays in the parameter list. ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
this paper. Their singleprecision twins are identical except for a leading "S" instead of "D" in the routine's name and REAL instead of DOUBLE PRECISION scalars and arrays in the parameter list.
Efficient Eigenvalue and Singular Value Computations on Shared Memory Machines
, 1998
"... We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singluar values are sought then the ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singluar values are sought then the reduction to condensed form may be done in two or more steps to make best use of optimized level3 BLAS. If eigenvectors and/or singular vectors are required, too, then their accumulation can be sped up by another blocking technique. The efficiency of the blocked algorithms depends heavily on the values of certain control parameters. We also present a very simple performance model that allows selecting these parameters automatically. Keywords: Linear algebra; Eigenvalues and singular values; Reduction to condensed form; Hessenberg QR iteration; Blocked algorithms. 1 Introduction The problem of determining eigenvalues and associated eigenvectors (or singular values and vectors) of a matrix ...
Computing Approximate Eigenpairs of Symmetric Block Tridiagonal Matrices
, 2003
"... A divideandconquer method for computing approximate eigenvalues and eigenvectors of a block tridiagonal matrix is presented. In contrast to a method described earlier [W. N. G55(yyu[ R. C. Ward, and R. P. Muller, ACM Trans. Matfi Software, 28 (2002), pp. 4558], the o#diagonal blocks can have ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
A divideandconquer method for computing approximate eigenvalues and eigenvectors of a block tridiagonal matrix is presented. In contrast to a method described earlier [W. N. G55(yyu[ R. C. Ward, and R. P. Muller, ACM Trans. Matfi Software, 28 (2002), pp. 4558], the o#diagonal blocks can have arbitrary ranks. It is shown that lower rank approximations of the o#diagonal blocks as well as relaxation of deflation criteria permit the computation of approximate eigenpairs with prescribed accuracy at significantly reduced computational cost compared to standard methods such as, for example, implemented in Lapack.
DIVIDE & CONQUER ON HYBRID GPUACCELERATED MULTICORE SYSTEMS
"... Abstract. With the raw compute power of GPUs being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider the solution of dense symmetric and Hermitian eigenproblems by LAPACK ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract. With the raw compute power of GPUs being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider the solution of dense symmetric and Hermitian eigenproblems by LAPACK’s Divide & Conquer algorithm on such modern heterogeneous systems. We focus on how to make the best use of the individual strengths of the massively parallel manycore GPUs and multicore CPUs. The resulting algorithm overcomes performance bottlenecks that current implementations, optimized for a homogeneous multicore face. On a dual socket quadcore Intel Xeon 2.33 GHz with an NVIDIA GTX 280 GPU, we typically obtain up to about 10fold improvement in performance for the complete dense problem. The techniques described here thus represent an example on how to develop numerical software to efficiently use heterogeneous architectures. As heterogeneity becomes common in the architecture design, the significance and need of this work is expected to grow.
Prospectus for the Next LAPACK and ScaLAPACK Libraries
"... Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Dense linear algebra (DLA) forms the core of many scientific computing applications. Consequently, there is continuous interest and demand for the development of increasingly better algorithms in the field. Here ’better ’ has a broad meaning, and includes improved reliability, accuracy, robustness, ease of use, and
Parallel Block Tridiagonalization of Real Symmetric Matrices
, 2006
"... Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical preprocessing step for the blocktridiagonal divideandconquer algorithm for computing eigensystems and is useful for many algorithms desirin ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical preprocessing step for the blocktridiagonal divideandconquer algorithm for computing eigensystems and is useful for many algorithms desiring the efficiencies of block structure in matrices. For an "effectively" sparse matrix, which frequently results from applications with strong locality properties, a heuristic parallel algorithm is used to transform it into a block tridiagonal matrix such that the eigenvalue errors remain bounded by some prescribed accuracy tolerance. For a dense matrix without any usable structure, orthogonal transformations are used to reduce it to block tridiagonal form using mostly level 3 BLAS operations. Numerical experiments show that blocktridiagonal structure obtained from this algorithm directly affects the computational complexity of the parallel blocktridiagonal divideandconquer eigensolver.
DIVIDE & CONQUER ON HYBRID GPUACCELERATED MULTICORE SYSTEMS
, 2012
"... With the raw compute power of GPUs being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider the solution of dense symmetric and Hermitian eigenproblems by LAPACK’s Divid ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
With the raw compute power of GPUs being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider the solution of dense symmetric and Hermitian eigenproblems by LAPACK’s Divide & Conquer algorithm on such modern heterogeneous systems. We focus on how to make the best use of the individual strengths of the massively parallel manycore GPUs and multicore CPUs. The resulting algorithm overcomes performance bottlenecks that current implementations, optimized for a homogeneous multicore face. On a dual socket quadcore Intel Xeon 2.33 GHz with an NVIDIA GTX 280 GPU, we typically obtain up to about 10fold improvement in performance for the complete dense problem. The techniques described here thus represent an example on how to develop numerical software to efficiently use heterogeneous architectures. As heterogeneity becomes common in the architecture design, the significance and need of this work is expected to grow.