Results 1 - 10
of
27
Gossiping in Minimal Time
- SIAM J. on Computing
, 1992
"... The gossip problem involves communicating a unique item from each node in a graph to every other node. We study the minimum time required to do this under the weakest model of parallel communication which allows each node to participate in just one communication at a time as either sender or receive ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
The gossip problem involves communicating a unique item from each node in a graph to every other node. We study the minimum time required to do this under the weakest model of parallel communication which allows each node to participate in just one communication at a time as either sender or receiver. We study a number of topologies including the complete graph, grids, hypercubes and rings. Definitive new optimal time algorithms are derived for complete graphs, rings, regular grids and toroidal grids that significantly extend existing results. In particular, we settle an open problem about minimum time gossiping in complete graphs. Specifically, for a graph with N nodes, at least log ae N communication steps, where the logarithm is in the base of the golden ratio ae, are required by any algorithm under the weakest model of communication. This bound, which is approximately 1:44 log 2 N , can be realized for some networks and so the result is optimal. KEYWORDS: Gossiping, broadcasting. ...
Mapping Algorithms and Software Environment for Data Parallel PDE . . .
- JOURNAL OF DISTRIBUTED AND PARALLEL COMPUTING
, 1994
"... We consider computations associated with data parallel iterative solvers used for the numerical solution of Partial Differential Equations (PDEs). The mapping of such computations into load balanced tasks requiring minimum synchronization and communication is a difficult combinatorial optimization p ..."
Abstract
-
Cited by 31 (19 self)
- Add to MetaCart
We consider computations associated with data parallel iterative solvers used for the numerical solution of Partial Differential Equations (PDEs). The mapping of such computations into load balanced tasks requiring minimum synchronization and communication is a difficult combinatorial optimization problem. Its optimal solution is essential for the efficient parallel processing of PDE computations. Determining data mappings that optimize a number of criteria, likeworkload balance, synchronization and local communication, often involves the solution of an NP-Complete problem. Although data mapping algorithms have been known for a few years there is lack of qualitative and quantitative comparisons based on the actual performance of the parallel computation. In this paper we present two new data mapping algorithms and evaluate them together with a large number of existing ones using the actual performance of data parallel iterative PDE solvers on the nCUBE II. Comparisons on the performance of data parallel iterative PDE solvers on medium and large scale problems demonstrate that some computationally inexpensive data block partitioning algorithms are as effective as the computationally expensive deterministic optimization algorithms. Also, these comparisons demonstrate that the existing approach in solving the data partitioning problem is inefficient for large scale problems. Finally, a software environment for the solution of the partitioning problem of data parallel iterative solvers is presented.
Building a High-Performance Collective Communication Library
- In Supercomputing'94, Washington D. C
"... In this paper, we report on a project to develop a unified approach for building a library of collective communication operations that performs well on a cross-section of problems encountered in real applications. The target architecture is a two-dimensional mesh with worm-hole routing, but the tech ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
In this paper, we report on a project to develop a unified approach for building a library of collective communication operations that performs well on a cross-section of problems encountered in real applications. The target architecture is a two-dimensional mesh with worm-hole routing, but the techniques are more general. The approach differs from traditional library implementations in that we address the need for implementations that perform well for various sized vectors and grid dimensions, including non-power-oftwo grids. We show how a general approach to hybrid algorithms yields performance across the entire range of vector lengths. Moreover, many scalable implementations of application libraries require collective communication within groups of nodes. Our approach yields the same kind of performance for group collective communication. Results from the Intel Paragon system are included. To obtain this library for Intel systems contact intercom@cs.utexas.edu. 1 Introduction The I...
A Cost Analysis for a Higher-order Parallel Programming Model
, 1996
"... Programming parallel computers remains a difficult task. An ideal programming environment should enable the user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate low-level details without sacrificing performance. This thesis investiga ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Programming parallel computers remains a difficult task. An ideal programming environment should enable the user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate low-level details without sacrificing performance. This thesis investigates a model of parallel programming based on the BirdMeertens Formalism (BMF). This is a set of higher-order functions, many of which are implicitly parallel. Programs are expressed in terms of functions borrowed from BMF. A parallel implementation is defined for each of these functions for a particular topology, and the associated execution costs are derived. The topologies which have been considered include the hypercube, 2-D torus, tree and the linear array. An analyser estimates the costs associated with different implementations of a given program and selects a cost-effective one for a given topology. All the analysis is performed at compile-time which has the advantage of reducing run-...
Selected problems of scheduling tasks in multiprocessor computing systems
- PhD thesis, Instytut Informatyki Politechnika Poznanska
, 1997
"... ..."
LOCCS: Low Overhead Communication and Computation Subroutines
, 1993
"... Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we ..."
Abstract
-
Cited by 12 (9 self)
- Add to MetaCart
Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we provide to the user of parallel machines an easy SPMD type and efficient way of programming. The main purpose of theses routines is to be used in linear algebra applications but also in other fields like image processing or neural networks. This work was partially supported by ARCHIPEL S.A. under contract 820542, by the CNRS and the DRET 1 Introduction Libraries of routines have been proven to be the only way for efficient and secure programming. In scientific parallel computing, the most commonly used libraries are the BLAS, BLACS, PICL and the one provided by vendors. These building blocks allow the portability of codes and an efficient implementation on different machines. The devel...
Optimal Communication Primitives On The Generalized Hypercube Network
- Journal of Parallel and Distributed Computing
, 1994
"... Efficient interprocessor communication is crucial to increasing the performance of parallel computers. In this paper, a special framework is developed on the generalized hypercube, a network that is currently receiving considerable attention. Using this framework as the basic tool, a number of spann ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Efficient interprocessor communication is crucial to increasing the performance of parallel computers. In this paper, a special framework is developed on the generalized hypercube, a network that is currently receiving considerable attention. Using this framework as the basic tool, a number of spanning graphs with special properties to fit various communication needs, are constructed on the network. The importance of these spanning graphs is demonstrated with the development of optimal algorithms for four fundamental communication problems, namely, the single node and multinode broadcasting and the single node and multinode scattering, on the generalized hypercube network. Broadcasting is the distribution of the same group of messages from a source processor to all other processors, and scattering is the distribution of distinct groups of messages from a source processor to each other processor. We consider broadcasting and scattering from a single processor of the network (single nod...
Scattering and Gathering Messages in Networks of Processors
, 1993
"... The operations of scattering and gathering in a network of processors involve one processor of the network --- call it P 0 --- communicating with all other processors. In scattering, P 0 sends (possibly) distinct messages to all other processors; in gathering, the other processors send (possibly) di ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The operations of scattering and gathering in a network of processors involve one processor of the network --- call it P 0 --- communicating with all other processors. In scattering, P 0 sends (possibly) distinct messages to all other processors; in gathering, the other processors send (possibly) distinct messages to P 0 . We consider networks that are trees of processors; we present algorithms for scattering messages from and gathering messages to the processor that resides at the root of the tree. The algorithms are: ffl quite general, in that the messages transmitted can differ arbitrarily in length; ffl quite strong, in that they send messages along noncolliding paths, hence do not require any buffering or queuing mechanisms in the processors; ffl quite efficient: the algorithms for scattering in general trees are optimal, the algorithm for gathering in a path is optimal, and the algorithms for gathering in general trees are nearly optimal. Our algorithms can easily be converte...
Performance complexity of LU factorization with efficient pipelining and overlap on a multiprocessor
, 1994
"... In this paper, we make efficient use of pipelining on LU decomposition with pivoting and a column-scattered data decomposition to derive precise variations of the computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines. ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In this paper, we make efficient use of pipelining on LU decomposition with pivoting and a column-scattered data decomposition to derive precise variations of the computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines.
Performance Study of LU Factorization with Low Communication Overhead on Multiprocessors
- Parallel Processing Letters
, 1995
"... In this paper, we make efficient use of asynchronous communications on the LU decomposition algorithm with pivoting and a column-scattered data decomposition to derive precise computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines and s ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In this paper, we make efficient use of asynchronous communications on the LU decomposition algorithm with pivoting and a column-scattered data decomposition to derive precise computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines and show that very good performances can be obtained on a ring with asynchronous communications. Keywords: LU factorization, Pipelining, Parallel Complexity. 1. Introduction This paper presents an analytical estimation of the LU factorization algorithm on a distributed-memory message-passing multiprocessor. Numerous methods have been proposed for LU factorization (see PBKP92 and the related works of Saa86b;Cap87;CTV87;RT88;CRT89;DO90;Rob90 ). For example, CG87 advocates partial pivoting and load balancing in row-wise methods with a straightforward parallel triangular solve algorithm, but LC89 shows that the parallel triangular solve algorithm can have the same performance with column-wi...

