Results 1 - 10
of
18
Software libraries for linear algebra computations on high performance computers
- SIAM REVIEW
, 1995
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed b ..."
Abstract
-
Cited by 66 (17 self)
- Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct highe...
Hierarchical Scalable Photonic Architectures for High-Performance Processor Interconnection
- IEEE Transactions on Computers
, 1993
"... This paper introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discrete-event simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multi-access channels to be real ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
This paper introduces two hierarchical optical structures for processor interconnection and compares their performance through analytic models and discrete-event simulation. Both architectures are based on wavelength division multiplexing (WDM) which enables multiple multi-access channels to be realized on a single optical fiber. The objective of the hierarchical architectures is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, both hierarchical architectures are single-hop: a packet remains in the optical form from source to destination and does not require cross dimensional intermediate routing. The first structure is physically hierarchical but wavelength flat: all nodes share the same wavelength space. The second structure is a wavelength multiplexed hierarchical structure with wavelength channel re-use at each level, allowing it to be scaled to very large system sizes. It employs acousto-optic tunable filters in conjunc...
k-ary n-trees: High Performance Networks for Massively Parallel Architectures
- In Proceedings of the 11th International Parallel Processing Symposium, IPPS'97
, 1997
"... The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we study the communication performance of a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected i ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we study the communication performance of a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. Through simulation on a 4-ary 4-tree with 256 nodes, we analyze some variants of an adaptive algorithm that utilize wormhole routing with one, two and four virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 35 \Gamma 40% of the network capacity with one virtual channel, 55\Gamma60% with two virtual channels and around 75% with four virtual channels. The complement traffic, a representative of the class of the congestion-free communication patterns, reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies.
LIGHTNING Network and Systems Architecture
, 1996
"... LIGHTNING is adynamically reconfigurable WDM network testbed project for supercomputer interconnection. This paper describes a hierarchical WDM-based optical network testbed project that is being constructed to interconnect a large number of supercomputers and create a distributed shared memory envi ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
LIGHTNING is adynamically reconfigurable WDM network testbed project for supercomputer interconnection. This paper describes a hierarchical WDM-based optical network testbed project that is being constructed to interconnect a large number of supercomputers and create a distributed shared memory environment. The objective of the hierarchical architecture is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, single-hop all-optical communication is achieved: a packet remains in the optical form from source to destination and does not require intermediate routing. The wavelength multiplexed hierarchical structure features wavelength channel re-use at each level, allowing scalability to very large system sizes. It partitions the traffic between different levels of the hierarchy without electronic intervention in a combination of wavelength- and space-division multiplexing. A significant advantage of this approach is its ability to ...
Communication and Computation Performance of the CM-5
- in Supercomputing'93
, 1993
"... The Thinking Machines CM-5 is one of the first of a new generation of massively parallel systems. To assess the scalability of the CM-5's computation and interprocessor communication rates, we used a series of benchmarks to measure the performance of the CM-5 data and control networks, the node vect ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The Thinking Machines CM-5 is one of the first of a new generation of massively parallel systems. To assess the scalability of the CM-5's computation and interprocessor communication rates, we used a series of benchmarks to measure the performance of the CM-5 data and control networks, the node vector units, and the balance of computation and communication. At the application level, we found the achievable communication bandwidth and processing rates to be roughly fifty percent and forty percent, respectively, of the corresponding theoretical peak rates. Our early assessment is that the CM-5 is scalable but that a better balance of communication and processing rates would increase its effectiveness. 1 Introduction The Thinking Machines CM-5 is a member of the new generation of massively parallel systems. Although the CM-5 was announced in 1991, only recently have large CM-5 configurations appeared with vector units, compilers that allow Fortran programs to execute independently on eac...
The Design of Linear Algebra Libraries for High Performance Computers
, 1993
"... This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followe ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movementbetween di#erent levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subgrams #BLAS# as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms #BLACS# as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct ...
Performance Analysis of Wormhole Routed k-ary n-trees
, 1998
"... The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we formalize a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. A si ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we formalize a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. A simple adaptive routing algorithm for k-ary n-trees sends each message to one of the nearest common ancestors of both source and destination, choosing the less loaded physical channels, and then reaches the destination following the unique available path. Through simulation on a 4-ary 4-tree with 256 nodes, we analyze some variants of the adaptive algorithm that utilize wormhole routing with 1, 2 and 4 virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy.
Scotch 3.1 User's Guide
, 1996
"... The efficient execution of a parallel program on a parallel machine requires good placement of the communicating processes of the program onto the processors of the machine. When both the program and the machine are modeled in terms of weighted unoriented graphs, this problem amounts to static graph ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The efficient execution of a parallel program on a parallel machine requires good placement of the communicating processes of the program onto the processors of the machine. When both the program and the machine are modeled in terms of weighted unoriented graphs, this problem amounts to static graph mapping. This document describes the capabilities and operations of Scotch, a software package devoted to graph mapping, based on the Dual Recursive Bipartitioning algorithm. Predefined mapping strategies allow for recursive application of any of several graph bipartitioning methods, including Fiduccia-Mattheyses, Gibbs-Poole-Stockmeyer, and multi-level methods. Scotch can map any weighted process graph onto any weighted target graph, whether they are connected or not. We give brief descriptions of the algorithm and bipartitioning methods, detail the input/output formats, instructions for use, and installation procedures, and provide a number of examples.
Scalable Data Parallel Algorithms for Texture Synthesis using Gibbs Random Fields
- University of Maryland, College Park, MD
, 1993
"... This paper introduces scalable data parallel algorithms for image processing. Focusing on Gibbs and Markov Random Field model representation for textures, wepresent parallel algorithms for texture synthesis, compression, and maximum likelihood parameter estimation, currently implemented on Thinki ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This paper introduces scalable data parallel algorithms for image processing. Focusing on Gibbs and Markov Random Field model representation for textures, wepresent parallel algorithms for texture synthesis, compression, and maximum likelihood parameter estimation, currently implemented on Thinking Machines CM-2 and CM-5. Use of #ne-grained, data parallel processing techniques yields real-time algorithms for texture synthesis and compression that are substantially faster than the previously known sequential implementations. Although current implementations are on Connection Machines, the methodology presented here enables machine independent scalable algorithms for a number of problems in image processing and analysis.
Efficient Personalized Communication on Wormhole Networks
- 1997 International Conference on Parallel Architectures and Compilation Techniques, PACT'97
, 1997
"... Bridging models, as the BSP, tend to abstract the characteristics of the interconnection networks using a small set of parameters, by dividing the computation in supersteps and organizing the communication in global patterns called h-relations. In this paper we evaluate, through experimental results ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Bridging models, as the BSP, tend to abstract the characteristics of the interconnection networks using a small set of parameters, by dividing the computation in supersteps and organizing the communication in global patterns called h-relations. In this paper we evaluate, through experimental results conducted on a wormhole-routed bi-dimensional torus and a quaternary fat-tree with 256 processing nodes, the execution time of three families of h-relations with variable degree of imbalance. We also prove a strong result that links the communication performance of the fat-tree with the BSP abstraction of the interconnection network. Given a generic h-relation, we can provide a value of g that, in the worst case, slightly overestimates the completion time and is very close to optimality.

