## Efficient Eigenvalue and Singular Value Computations on Shared Memory Machines (1998)

Citations: | 8 - 0 self |

### BibTeX

@MISC{Bughw-sc98efficienteigenvalue,

author = {Preprint Bughw-sc and Bruno Lang and Bruno Lang},

title = {Efficient Eigenvalue and Singular Value Computations on Shared Memory Machines},

year = {1998}

}

### OpenURL

### Abstract

We describe two techniques for speeding up eigenvalue and singular value computations on shared memory parallel computers. Depending on the information that is required, different steps in the overall process can be made more efficient. If only the eigenvalues or singluar values are sought then the reduction to condensed form may be done in two or more steps to make best use of optimized level-3 BLAS. If eigenvectors and/or singular vectors are required, too, then their accumulation can be sped up by another blocking technique. The efficiency of the blocked algorithms depends heavily on the values of certain control parameters. We also present a very simple performance model that allows selecting these parameters automatically. Keywords: Linear algebra; Eigenvalues and singular values; Reduction to condensed form; Hessenberg QR iteration; Blocked algorithms. 1 Introduction The problem of determining eigenvalues and associated eigenvectors (or singular values and vectors) of a matrix ...

### Citations

743 |
A set of level 3 basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1990
(Show Context)
Citation Context ...Equipment Grant program. In this paper we present two techniques, each of which can speed up one of the two above-mentioned phases by increasing the portion of operations done within the level-3 BLAS =-=[8]-=-. Since the level-3 BLAS (matrix-matrix operations) allow significantly more floating-point operations (flop) to be done per data access than either matrix-vector or vector-vector operations or purely... |

492 |
Solving Least Square Problems
- Lawson
- 1974
(Show Context)
Citation Context ... (P 1 ; : : : ; P r ) T 2 R r , and "" denotes component-wise multiplication. From this equation we obtain the required parameterssb P and ` 1=2 by solving an r-by-2 nonnegative least square=-=s problem [17]-=-. 5 Numerical results The timings presented in this section were obtained on a Sun Ultra2 workstation with two processors and on a Sun Enterprise450 with four processors. Both machines feature a share... |

446 | An extended set of fortran basic linear algebra subprograms
- Dongarra, Croz, et al.
- 1988
(Show Context)
Citation Context ...2(n \Gamma j) 2 flop (because of the symmetry only the lower or the upper triangle of A j must be stored and updated) and can be accomplished by a call to an appropriate routine from the level-2 BLAS =-=[9]-=-. Therefore the majority of the roughly 4 3 n 3 flop in the overall reduction is done within the level-2 BLAS. In the LAPACK [1] routine xSYTRD (x stands for either S ingle or D ouble precision) a blo... |

423 |
LAPACK userâ€™s guide
- Anderson, Bai, et al.
- 1999
(Show Context)
Citation Context ... accomplished by a call to an appropriate routine from the level-2 BLAS [9]. Therefore the majority of the roughly 4 3 n 3 flop in the overall reduction is done within the level-2 BLAS. In the LAPACK =-=[1]-=- routine xSYTRD (x stands for either S ingle or D ouble precision) a blocked version of the above algorithm is implemented; cf. [10]. In this variant not all of the A j are built explicitly. Given A 1... |

159 | ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers
- Choi, Dongarra, et al.
- 1992
(Show Context)
Citation Context ...plicit QR algorithm for bidiagonal matrices B [16]. In contrast to the algorithm described above the work on T and B is not affected by the blocking. The most recent releases of the ScaLAPACK library =-=[6]-=- for distributed memory machines contain a pipelined variant of the double-shift Hessenberg QR iteration that relies on a reordering of the transformations resembling the one described above [7]. In c... |

109 |
The WY representation for products of Householder matrices
- Bischof, Loan
- 1987
(Show Context)
Citation Context ... grey denotes elements that are made zero during this step. Because of the symmetry, only the lower triangle of A j (shaded light grey) must be stored and transformed. WY representation Q j = I +WY T =-=[4]-=- or the compact WY representation Q j = I \Gamma Y TY T [18] may be used to perform the updates A 00 := Q T j A 00 (A 00 comprises the b \Gamma n b columns to the right of A 0 ; see Figure 1) and A 00... |

78 | Block reduction of matrices to condensed forms for eigenvalue computations
- Dongarra, Hammarling, et al.
- 1989
(Show Context)
Citation Context ...the overall reduction is done within the level-2 BLAS. In the LAPACK [1] routine xSYTRD (x stands for either S ingle or D ouble precision) a blocked version of the above algorithm is implemented; cf. =-=[10]. In this -=-variant not all of the A j are built explicitly. Given A 1 = A, the subsequent matrices A 2 ; : : : ; An b are instead kept in a "factored representation" A j = A 1 \Gamma Y j V j \Gamma V j... |

65 |
A storage-efficient WY representation for products of Householder transformations
- Schreiber, Loan
- 1991
(Show Context)
Citation Context ... Because of the symmetry, only the lower triangle of A j (shaded light grey) must be stored and transformed. WY representation Q j = I +WY T [4] or the compact WY representation Q j = I \Gamma Y TY T =-=[18]-=- may be used to perform the updates A 00 := Q T j A 00 (A 00 comprises the b \Gamma n b columns to the right of A 0 ; see Figure 1) and A 000 := Q T j A 000 Q j (remainder of the matrix) with matrix-m... |

50 |
On a block implementation of Hessenberg multishift QR iteration
- Bai, Demmel
- 1989
(Show Context)
Citation Context ...parallelizable nonsymmetric eigensolvers, the reduction of A to an upper Hessenberg matrix H , followed by double-shift Hessenberg QR iteration [12] or its generalization, the multishift QR iteration =-=[2]-=-, is still the most efficient procedure for computing the eigenvalues of a nonsymmetric matrix A 2 R n\Thetan . The double-shift algorithm condenses n s = 2 successive steps of the basic QR iteration ... |

42 |
The QR transformation: a unitary analogue to the LR transformation
- Francis
(Show Context)
Citation Context ...teration Despite numerous efforts to devise more easily parallelizable nonsymmetric eigensolvers, the reduction of A to an upper Hessenberg matrix H , followed by double-shift Hessenberg QR iteration =-=[12]-=- or its generalization, the multishift QR iteration [2], is still the most efficient procedure for computing the eigenvalues of a nonsymmetric matrix A 2 R n\Thetan . The double-shift algorithm conden... |

36 | A parallel implementation of the nonsymmetric QR algorithm for distributed memory architectures
- Henry, Watkins, et al.
(Show Context)
Citation Context ...brary [6] for distributed memory machines contain a pipelined variant of the double-shift Hessenberg QR iteration that relies on a reordering of the transformations resembling the one described above =-=[7]-=-. In contrast to our method, the transformations are applied directly to the matrices H and U . 4 Performance modelling and parameter estimation The performance of the blocked reduction algorithms dep... |

28 | A framework for symmetric band reduction
- Bischof, Lang, et al.
(Show Context)
Citation Context ...ng hardware. In Section 4 we will describe a method that can yield good a priori estimates for n b . To further increase the portion of matrix-matrix operations we split the reduction into two stages =-=[5]-=-. In the first stage the matrix A is reduced to a symmetric banded matrix A b with some semi-bandwidth b ? 1 (i.e., A b (i; j) = 0 if ji \Gamma jj ? b), and in the second stage the banded matrix A b i... |

22 | Parallel tridiagonalization through twostep band reduction
- Bischof, Lang, et al.
- 1984
(Show Context)
Citation Context ...umulated then some 2n 3 additional operations are required. A reordering technique---similar to the one described in Section 3---allows using the WY representation (i.e., level-3 BLAS) in this update =-=[3]-=-. From 2. we see that the work spent in the second stage is negligible as compared to Stage I. Therefore the overall two-stage reduction relies almost exclusively on the level-3 BLAS, and we may expec... |

19 |
A parallel algorithm for reducing symmetric banded matrices to tridiagonal form
- Lang
(Show Context)
Citation Context ...decomposition, setting up the W , Y or T , Y factors) being done mainly with level-2 BLAS. An efficient algorithm xSBTH for the second stage (tridiagonalization of the banded matrix) was presented in =-=[14]-=-. Here we only recall some of its properties that have a direct impact on the two-stage tridiagonalization approach. 1. For optimum performance, xSBTH requires the banded matrix A b to be stored in pa... |

12 |
The multishift QR algorithm: is it worth the trouble? Palo Alto Scienti c Center Report G320-3558x
- Dubrulle
- 1991
(Show Context)
Citation Context ... steps). For large n s these savings approach 20%. In addition, the length of the Householder transformations is n s + 1 instead of 3, leading to better data locality. Bai and Demmel [2] and Dubrulle =-=[11]-=- were able to further improve the data locality by introducing blocking techniques based on factored representations---similar to the one described in Section 2. In both variants roughly one half of t... |

6 |
Parallel Reduction of Banded Matrices to Bidiagonal Form
- Lang
- 1996
(Show Context)
Citation Context ...ns Q j and P j are determined from a QR decomposition of a suitable block column and an LQ decomposition of a block row of A j , and the WY representation is used for applying the transformations. In =-=[15]-=- the author presented an efficient algorithm for Stage II featuring similar properties as xSBTH . Therefore the conclusions drawn for the two-stage tridiagonalization carry over to the bidiagonal redu... |

5 |
Using level 3 BLAS in rotation based algorithms
- Lang
- 1995
(Show Context)
Citation Context ...imilar blocking technique can also be used for the accumulation of the transformations in the QR iteration for symmetric tridiagonal matrices T and the implicit QR algorithm for bidiagonal matrices B =-=[16]-=-. In contrast to the algorithm described above the work on T and B is not affected by the blocking. The most recent releases of the ScaLAPACK library [6] for distributed memory machines contain a pipe... |