Results 1  10
of
18
Software libraries for linear algebra computations on high performance computers
 SIAM REVIEW
, 1995
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed b ..."
Abstract

Cited by 68 (17 self)
 Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
Hierarchical Scalable Photonic Architectures for HighPerformance Processor Interconnection
 IEEE Transactions on Computers
, 1993
"... This paper introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discreteevent simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multiaccess channels to be real ..."
Abstract

Cited by 34 (8 self)
 Add to MetaCart
This paper introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discreteevent simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multiaccess channels to be realized on a single optical fiber. The objective of the hierarchical architectures is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, both hierarchical architectures are singlehop: a packet remains in the optical form from source to destination and does not require cross dimensional intermediate routing. The first structure is physically hierarchical but wavelength flat: all nodes share the same wavelength space. The second structure is a wavelength multiplexed hierarchical structure with wavelength channel reuse at each level, allowing it to be scaled to very large system sizes. It employs acoustooptic tunable filters in conjunc...
kary ntrees: High Performance Networks for Massively Parallel Architectures
 In Proceedings of the 11th International Parallel Processing Symposium, IPPS'97
, 1997
"... The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we study the communication performance of a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected i ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we study the communication performance of a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected in a regular topology. Through simulation on a 4ary 4tree with 256 nodes, we analyze some variants of an adaptive algorithm that utilize wormhole routing with one, two and four virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 35 \Gamma 40% of the network capacity with one virtual channel, 55\Gamma60% with two virtual channels and around 75% with four virtual channels. The complement traffic, a representative of the class of the congestionfree communication patterns, reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies.
LIGHTNING Network and Systems Architecture
, 1996
"... LIGHTNING is adynamically reconfigurable WDM network testbed project for supercomputer interconnection. This paper describes a hierarchical WDMbased optical network testbed project that is being constructed to interconnect a large number of supercomputers and create a distributed shared memory envi ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
LIGHTNING is adynamically reconfigurable WDM network testbed project for supercomputer interconnection. This paper describes a hierarchical WDMbased optical network testbed project that is being constructed to interconnect a large number of supercomputers and create a distributed shared memory environment. The objective of the hierarchical architecture is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, singlehop alloptical communication is achieved: a packet remains in the optical form from source to destination and does not require intermediate routing. The wavelength multiplexed hierarchical structure features wavelength channel reuse at each level, allowing scalability to very large system sizes. It partitions the traffic between different levels of the hierarchy without electronic intervention in a combination of wavelength and spacedivision multiplexing. A significant advantage of this approach is its ability to ...
The Design of Linear Algebra Libraries for High Performance Computers
, 1993
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followe ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of blockpartitioned algorithms in reducing the frequency of data movementbetween di#erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subgrams #BLAS# as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms #BLACS# as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct ...
Communication and Computation Performance of the CM5
 in Supercomputing'93
, 1993
"... The Thinking Machines CM5 is one of the first of a new generation of massively parallel systems. To assess the scalability of the CM5's computation and interprocessor communication rates, we used a series of benchmarks to measure the performance of the CM5 data and control networks, the node vect ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
The Thinking Machines CM5 is one of the first of a new generation of massively parallel systems. To assess the scalability of the CM5's computation and interprocessor communication rates, we used a series of benchmarks to measure the performance of the CM5 data and control networks, the node vector units, and the balance of computation and communication. At the application level, we found the achievable communication bandwidth and processing rates to be roughly fifty percent and forty percent, respectively, of the corresponding theoretical peak rates. Our early assessment is that the CM5 is scalable but that a better balance of communication and processing rates would increase its effectiveness. 1 Introduction The Thinking Machines CM5 is a member of the new generation of massively parallel systems. Although the CM5 was announced in 1991, only recently have large CM5 configurations appeared with vector units, compilers that allow Fortran programs to execute independently on eac...
Scotch 3.1 User's Guide
, 1996
"... The efficient execution of a parallel program on a parallel machine requires good placement of the communicating processes of the program onto the processors of the machine. When both the program and the machine are modeled in terms of weighted unoriented graphs, this problem amounts to static graph ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
The efficient execution of a parallel program on a parallel machine requires good placement of the communicating processes of the program onto the processors of the machine. When both the program and the machine are modeled in terms of weighted unoriented graphs, this problem amounts to static graph mapping. This document describes the capabilities and operations of Scotch, a software package devoted to graph mapping, based on the Dual Recursive Bipartitioning algorithm. Predefined mapping strategies allow for recursive application of any of several graph bipartitioning methods, including FiducciaMattheyses, GibbsPooleStockmeyer, and multilevel methods. Scotch can map any weighted process graph onto any weighted target graph, whether they are connected or not. We give brief descriptions of the algorithm and bipartitioning methods, detail the input/output formats, instructions for use, and installation procedures, and provide a number of examples.
Performance Analysis of Wormhole Routed kary ntrees
, 1998
"... The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we formalize a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected in a regular topology. A si ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we formalize a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected in a regular topology. A simple adaptive routing algorithm for kary ntrees sends each message to one of the nearest common ancestors of both source and destination, choosing the less loaded physical channels, and then reaches the destination following the unique available path. Through simulation on a 4ary 4tree with 256 nodes, we analyze some variants of the adaptive algorithm that utilize wormhole routing with 1, 2 and 4 virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy.
Scalable Data Parallel Algorithms for Texture Synthesis using Gibbs Random Fields
 University of Maryland, College Park, MD
, 1993
"... This paper introduces scalable data parallel algorithms for image processing. Focusing on Gibbs and Markov Random Field model representation for textures, wepresent parallel algorithms for texture synthesis, compression, and maximum likelihood parameter estimation, currently implemented on Thinki ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
This paper introduces scalable data parallel algorithms for image processing. Focusing on Gibbs and Markov Random Field model representation for textures, wepresent parallel algorithms for texture synthesis, compression, and maximum likelihood parameter estimation, currently implemented on Thinking Machines CM2 and CM5. Use of #negrained, data parallel processing techniques yields realtime algorithms for texture synthesis and compression that are substantially faster than the previously known sequential implementations. Although current implementations are on Connection Machines, the methodology presented here enables machine independent scalable algorithms for a number of problems in image processing and analysis.
Efficient Personalized Communication on Wormhole Networks
 1997 International Conference on Parallel Architectures and Compilation Techniques, PACT'97
, 1997
"... Bridging models, as the BSP, tend to abstract the characteristics of the interconnection networks using a small set of parameters, by dividing the computation in supersteps and organizing the communication in global patterns called hrelations. In this paper we evaluate, through experimental results ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Bridging models, as the BSP, tend to abstract the characteristics of the interconnection networks using a small set of parameters, by dividing the computation in supersteps and organizing the communication in global patterns called hrelations. In this paper we evaluate, through experimental results conducted on a wormholerouted bidimensional torus and a quaternary fattree with 256 processing nodes, the execution time of three families of hrelations with variable degree of imbalance. We also prove a strong result that links the communication performance of the fattree with the BSP abstraction of the interconnection network. Given a generic hrelation, we can provide a value of g that, in the worst case, slightly overestimates the completion time and is very close to optimality.