## Parallel Bandreduction and Tridiagonalization (1993)

### Cached

### Download Links

- [info.mcs.anl.gov]
- [www-unix.mcs.anl.gov]
- DBLP

### Other Repositories/Bibliography

Venue: | Proceedings, Sixth SIAM Conference on Parallel Processing for Scientific Computing |

Citations: | 17 - 5 self |

### BibTeX

@INPROCEEDINGS{Bischof93parallelbandreduction,

author = {Christian Bischof and Mercedes Marques and Xiaobai Sun},

title = {Parallel Bandreduction and Tridiagonalization},

booktitle = {Proceedings, Sixth SIAM Conference on Parallel Processing for Scientific Computing},

year = {1993},

pages = {22--24},

publisher = {SIAM}

}

### OpenURL

### Abstract

This paper presents a parallel implementation of a blocked band reduction algorithm for symmetric matrices suggested by Bischof and Sun. The reduction to tridiagonal or block tridiagonal form is a special case of this algorithm. A blocked double torus wrap mapping is used as the underlying data distribution and the so-called WY representation is employed to represent block orthogonal transformations. Preliminary performance results on the Intel Delta indicate that the algorithm is well-suited to a MIMD computing environment and that the use of a block approach significantly improves performance. 1 Introduction Reduction to tridiagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approach [13, p. 276] or block variants thereof [12] is the method of choice. These two approaches also underlie the parallel implementations described for example in [15] and [10]. The approach described in this ...

### Citations

455 |
LAPACK Usersâ€™ Guide
- Anderson
- 1999
(Show Context)
Citation Context ...s b nb zero out "bulge" d Fig. 3. First Block Bandreduction Step. zero out bulge Fig. 4. Chasing the Bulge. employing a memory hierarchy, since it greatly reduces the amount of data movement=-= required [8, 4, 5, 1]-=-. To see how this algorithmic primitive is used, let us consider an example. Figure 3 shows the reduction of d subdiagonals in nb columns of a matrix with initial bandwidth b. In this figure, as in th... |

117 |
The WY representation for products of Householder matrices
- Bischof, Loan
- 1987
(Show Context)
Citation Context ...older reductions and the accumulation of transformations W and Y such that T = (I \Gamma WY ) / R 0 ! : W and Y are of the same size as T and are accumulated according to what is called "method 1=-=" in [8]-=-. PRE: The application of the block orthogonal transformation I \Gamma WY to a matrix B (say) from the left, i.e. B / (I \Gamma WY ) T B: SYM: The application of I \Gamma WY to a matrix C (say) from t... |

82 | Block reduction of matrices to condensed forms for eigenvalue computations
- Dongarra, Sorensen, et al.
- 1989
(Show Context)
Citation Context ...diagonal form is a major step in eigenvalue computations for symmetric matrices. If the matrix is full, the conventional Householder tridiagonalization approach [13, p. 276] or block variants thereof =-=[12]-=- is the method of choice. These two approaches also underlie the parallel implementations described for example in [15] and [10]. The approach described in this paper, on the other hand, follows the b... |

53 |
A look at scalable dense linear algebra libraries
- Dongarra, Geijn, et al.
- 1992
(Show Context)
Citation Context ...is is the same kernel as in the POST step. 3 A Parallel Implementation To distribute the matrix across a distributed-memory machine, we chose a blocked twodimensional torus wrapping (see, for example =-=[17, 11, 10]-=-). A scalar wrap mapping and onedimensional (i.e. row or column oriented distributions) are special cases of this mapping. This mapping also has been selected in other efforts to develop linear algebr... |

31 |
de Geijn. Reduction to condensed form for the eigenvalue problem on distributed memory computers. Computer Science Dept
- Dongarra, van
- 1991
(Show Context)
Citation Context ...lder tridiagonalization approach [13, p. 276] or block variants thereof [12] is the method of choice. These two approaches also underlie the parallel implementations described for example in [15] and =-=[10]-=-. The approach described in this paper, on the other hand, follows the band reduction framework suggested by Bischof and Sun [7]. The standard approach, which eliminates all subdiagonals at one time, ... |

28 |
de Geijn. Optimal broadcasting in mesh-connected architectures
- Barnett, Payne, et al.
- 1991
(Show Context)
Citation Context ...e that the matrix size divides evenly by the block size. We also note that the Chamelon tools currently provide only unoptimized "fan-in/fan-out" broadcast and global sum primitives (see, fo=-=r example [22, 3]-=-) which are substantially slower than primitives that are optimized for the Intel Delta (e.g. [2, 16]). These issues will be addressed in future versions of our code. 4 Preliminary Performance Results... |

26 |
On jacobi rotation patterns
- Rutishauser
- 1963
(Show Context)
Citation Context ...ial case, but "piecemeal" approaches are possible as well, as illustrated in Figure 1. The "piecemeal" approach was shown to be attractive in comparison to previously suggested ban=-=d reduction schemes [20, 21, 18, 19]-=- in that it allows tradeoffs between flops and storage. This work was supported by the Applied and Computational Mathematics Program, Defense Advanced Research Projects Agency, under contract DM28E041... |

18 |
de Geijn. On global combine operations
- van
- 1994
(Show Context)
Citation Context ...e that the matrix size divides evenly by the block size. We also note that the Chamelon tools currently provide only unoptimized "fan-in/fan-out" broadcast and global sum primitives (see, fo=-=r example [22, 3]-=-) which are substantially slower than primitives that are optimized for the Intel Delta (e.g. [2, 16]). These issues will be addressed in future versions of our code. 4 Preliminary Performance Results... |

17 | Matrix Multiplication on the Intel Touchstone DELTA
- Huss-Lederman, Jacobson, et al.
- 1994
(Show Context)
Citation Context ...ly provide only unoptimized "fan-in/fan-out" broadcast and global sum primitives (see, for example [22, 3]) which are substantially slower than primitives that are optimized for the Intel De=-=lta (e.g. [2, 16]-=-). These issues will be addressed in future versions of our code. 4 Preliminary Performance Results In this section, we present preliminary performance results that we have obtained with a double-prec... |

15 |
Banded eigenvalue solvers on vector machines
- Kaufman
- 1984
(Show Context)
Citation Context ...ial case, but "piecemeal" approaches are possible as well, as illustrated in Figure 1. The "piecemeal" approach was shown to be attractive in comparison to previously suggested ban=-=d reduction schemes [20, 21, 18, 19]-=- in that it allows tradeoffs between flops and storage. This work was supported by the Applied and Computational Mathematics Program, Defense Advanced Research Projects Agency, under contract DM28E041... |

14 |
Tridiagonalization of a symmetric band matrix
- Schwarz
- 1968
(Show Context)
Citation Context ...ial case, but "piecemeal" approaches are possible as well, as illustrated in Figure 1. The "piecemeal" approach was shown to be attractive in comparison to previously suggested ban=-=d reduction schemes [20, 21, 18, 19]-=- in that it allows tradeoffs between flops and storage. This work was supported by the Applied and Computational Mathematics Program, Defense Advanced Research Projects Agency, under contract DM28E041... |

12 | A parallel implementation of the invariant subspace decomposition algorithm for dense symmetric matrices
- Huss-Lederman, Tsao, et al.
- 1993
(Show Context)
Citation Context ...is is the same kernel as in the POST step. 3 A Parallel Implementation To distribute the matrix across a distributed-memory machine, we chose a blocked twodimensional torus wrapping (see, for example =-=[17, 11, 10]-=-). A scalar wrap mapping and onedimensional (i.e. row or column oriented distributions) are special cases of this mapping. This mapping also has been selected in other efforts to develop linear algebr... |

8 |
Computing the singular value decomposition on a distributed system of vector processors
- Bischof
- 1989
(Show Context)
Citation Context ...s b nb zero out "bulge" d Fig. 3. First Block Bandreduction Step. zero out bulge Fig. 4. Chasing the Bulge. employing a memory hierarchy, since it greatly reduces the amount of data movement=-= required [8, 4, 5, 1]-=-. To see how this algorithmic primitive is used, let us consider an example. Figure 3 shows the reduction of d subdiagonals in nb columns of a matrix with initial bandwidth b. In this figure, as in th... |

6 |
The torus-wrap mapping for dense matrix calculations on massively parallel computers
- Hendrikson, Womble
- 1992
(Show Context)
Citation Context ...l Householder tridiagonalization approach [13, p. 276] or block variants thereof [12] is the method of choice. These two approaches also underlie the parallel implementations described for example in =-=[15]-=- and [10]. The approach described in this paper, on the other hand, follows the band reduction framework suggested by Bischof and Sun [7]. The standard approach, which eliminates all subdiagonals at o... |

5 |
de Geijn. Efficient communication primitives on mesh architectures with hardware routing
- Barnett, Littlefield, et al.
- 1993
(Show Context)
Citation Context ...ly provide only unoptimized "fan-in/fan-out" broadcast and global sum primitives (see, for example [22, 3]) which are substantially slower than primitives that are optimized for the Intel De=-=lta (e.g. [2, 16]-=-). These issues will be addressed in future versions of our code. 4 Preliminary Performance Results In this section, we present preliminary performance results that we have obtained with a double-prec... |

5 |
A divide-and-conquer method for computing complementary invariant subspaces of symmetric matrices
- Bischof, Sun
- 1992
(Show Context)
Citation Context ...ming system that our implementation builds on. Lastly we mention that, in the end, we plan to exploit the divide-and-conquer nature of the tridiagonalization of matrices with only 0 and 1 eigenvalues =-=[6]-=- as it arises in the ISDA eigenvalue solver framework [17]. With the 2D torus wrapped data mapping, this approach would not significantly reduce the communication requirements of the code, but would a... |

2 |
Parallele Reduktion symmetrischer Bandmatrizen auf Tridiagonalgestalt
- Lang
- 1991
(Show Context)
Citation Context |

1 |
LAPACK for distributed-memory machines: The next generation
- Demmel, Dongarra, et al.
- 1993
(Show Context)
Citation Context ...ns) are special cases of this mapping. This mapping also has been selected in other efforts to develop linear algebra basis software for massively parallel machines, for example the ScaLAPACK project =-=[9, 11]-=-. We also assume that our underlying hardware is logically configured as a p \Theta p mesh, and that only one process is active on every processor. While the code relies heavily on the key primitive s... |

1 |
Chamelon: Parallel programming tools user manual, tech
- Gropp, Smith
- 1993
(Show Context)
Citation Context ...e row or column of the mesh. Hence, to develop a portable code, and to allow a maintainable implementation of this code, we chose to base our implementation on the Chamelon parallel programming tools =-=[14]. Chamelon-=-'s primitives (such as broadcast or global summation) support arbitrary process groups, and several such "computational contexts" may be active at any given point in time. This greatly simpl... |