Results 1 -
7 of
7
The spectral decomposition of nonsymmetric matrices on distributed memory parallel computers
- SIAM J. Sci. Comput
, 1997
"... Abstract. The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conqu ..."
Abstract
-
Cited by 29 (10 self)
- Add to MetaCart
Abstract. The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conquer (SDC) algorithm with Newton iteration. Although the algorithm requires several times as many floating point operations as the best serial QR algorithm, it can be simply constructed from a small set of highly parallelizable matrix building blocks within Level 3 basic linear algebra subroutines (BLAS). Efficient implementations of these building blocks are available on a wide range of machines. In some ill-conditioned cases, the algorithm may lose numerical stability, but this can easily be detected and compensated for. The algorithm reached 31 % efficiency with respect to the underlying PUMMA matrix multiplication and 82 % efficiency with respect to the underlying ScaLAPACK matrix inversion on a 256 processor Intel Touchstone Delta system, and 41 % efficiency with respect to the matrix multiplication in CMSSL on a 32 node Thinking Machines CM-5 with vector units. Our performance model predicts the performance reasonably accurately. To take advantage of the geometric nature of SDC algorithms, we have designed a graphical user interface to let the user choose the spectral decomposition according to specified regions in the complex plane.
A Parallel Object-Oriented System for Realizing Reusable and Efficient Data Abstractions
, 1993
"... ..."
Practical Issues Related to Developing Object-Oriented Numerical Libraries
- In OON-SKI'94 Proceedings of the Second Annual Object-Oriented Numerics Conference
, 1994
"... In this paper a tool is presented for the development of numerical libraries for an objectoriented environment. The development of libraries has become more complicated as libraries have become larger and the subroutines have become more complex. In the object--oriented environment this development ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
In this paper a tool is presented for the development of numerical libraries for an objectoriented environment. The development of libraries has become more complicated as libraries have become larger and the subroutines have become more complex. In the object--oriented environment this development is further complicated by the object interface which usually resides between the application and the numerical subroutines. The tools being presented here are meant to help the development of high--performance libraries. 1 Introduction In this paper we present a tool for the development of numerical libraries for an object--oriented environment. We believe that library designers need tools such as these in order to develop, and maintain, increasingly complex numerical libraries for high performance systems, including both sequential and parallel architectures. Without such tools it will be impossible for the library designer to be responsive to the requests of the application programmers fo...
Superscalar Performance in a Multithreaded Microprocessor
, 1993
"... Multithreaded processors, having hardware support for the concurrent execution of fine-grained threaded computations, are noted for their latency tolerance and low-cost synchronization. Multithreading is a technique for improving the utilization of processing elements (PEs) in parallel processing sy ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Multithreaded processors, having hardware support for the concurrent execution of fine-grained threaded computations, are noted for their latency tolerance and low-cost synchronization. Multithreading is a technique for improving the utilization of processing elements (PEs) in parallel processing systems, thereby reducing cost/performance ratios. With increasing integrated circuit densities it is becoming feasible to integrate several PEs onto a single die, and further diminish the physical dimensions of parallel systems. However, by eliminating the artificial on-chip PE boundaries and sharing expensive resources in a more tightly coupled multithreaded architecture, even greater performance can be achieved from similar hardware. A multithreaded processor architecture (Concurro) was designed for possible microprocessor implementation with the objective of multiple instruction issues per cycle---sustained superscalar performance---by means of multithreading. This thesis considers the tra...
Parallel image processing system on a cluster of personal computers
- in VECPAR 2000, 4th Int. Conf
, 2001
"... Abstract. The most demanding image processing applications require real time processing, often using special purpose hardware. The work herein presented refers to the application of cluster computing for o line image processing, where the end user bene ts from the operation of otherwise idle process ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. The most demanding image processing applications require real time processing, often using special purpose hardware. The work herein presented refers to the application of cluster computing for o line image processing, where the end user bene ts from the operation of otherwise idle processors in the local LAN. The virtual parallel computer is composed by o-the-shelf personal computers connected by alow cost network, such as a 10 Mbits=s Ethernet. The aim is to minimise the processing time of a high level image processing package. The system developed to manage the parallel execution is described and some results obtained for the parallelisation of high level image processing algorithms are discussed, namely for active contour and modal analysis methods which require the computation of the eigenvectors of a symmetric matrix. 1
Evaluation And Exploitation Of Locality In The Data Driven Execution Model
, 1993
"... OF PH.D. DISSERTATION EVALUATION AND EXPLOITATION OF LOCALITY IN THE DATA DRIVEN EXECUTION MODEL The adventofhybrid von Neumann-data driven architectures arose from a desire to combine the most salient features of coarse grain von Neumann and #ne-grain data driven models. Hybrid architectures achie ..."
Abstract
- Add to MetaCart
OF PH.D. DISSERTATION EVALUATION AND EXPLOITATION OF LOCALITY IN THE DATA DRIVEN EXECUTION MODEL The adventofhybrid von Neumann-data driven architectures arose from a desire to combine the most salient features of coarse grain von Neumann and #ne-grain data driven models. Hybrid architectures achieve high performance through concurrent execution and the exploitation of program and data locality in conjunction with a reduction in the execution overhead associated with instruction-level synchronization inherent in classical data#ow processors. This is accomplished by increasing the task granularity from one to multiple instructions and executing the resulting tasks as traditional von Neumann instruction threads. The addition of coarse grain features has brought signi#cant improvements to the basic data driven model. Each of these coarse grain features addresses one aspect of the tradeo# between locality and #ne grain parallelism. The characteristics and nature of these are relatively we...
Optimization of BLAS on the Cell Processor
"... The unique architecture of the heterogeneous multi-core Cell processor offers great potential for high performance computing. It offers features such as high memory bandwidth using DMA, user managed local stores and SIMD architecture. In this paper, we present strategies for leveraging these feature ..."
Abstract
- Add to MetaCart
The unique architecture of the heterogeneous multi-core Cell processor offers great potential for high performance computing. It offers features such as high memory bandwidth using DMA, user managed local stores and SIMD architecture. In this paper, we present strategies for leveraging these features to develop a high performance BLAS library. We propose techniques to partition and distribute data across SPEs for handling DMA efficiently. We show that suitable pre-processing of data leads to significant performance improvements, particularly when data is unaligned. In addition, we use a combination of two kernels – a specialized high performance kernel for the more frequently occurring cases and a generic kernel for handling boundary cases – to obtain better performance. Using these techniques for double precision, we obtain up to 70-80 % of peak performance for different memory bandwidth bound BLAS level 1 and 2 routines and up to 80-90 % for computation bound BLAS level 3 routines. 1

