Results 1  10
of
28
Gossiping in Minimal Time
 SIAM J. on Computing
, 1992
"... The gossip problem involves communicating a unique item from each node in a graph to every other node. We study the minimum time required to do this under the weakest model of parallel communication which allows each node to participate in just one communication at a time as either sender or receive ..."
Abstract

Cited by 46 (2 self)
 Add to MetaCart
The gossip problem involves communicating a unique item from each node in a graph to every other node. We study the minimum time required to do this under the weakest model of parallel communication which allows each node to participate in just one communication at a time as either sender or receiver. We study a number of topologies including the complete graph, grids, hypercubes and rings. Definitive new optimal time algorithms are derived for complete graphs, rings, regular grids and toroidal grids that significantly extend existing results. In particular, we settle an open problem about minimum time gossiping in complete graphs. Specifically, for a graph with N nodes, at least log ae N communication steps, where the logarithm is in the base of the golden ratio ae, are required by any algorithm under the weakest model of communication. This bound, which is approximately 1:44 log 2 N , can be realized for some networks and so the result is optimal. KEYWORDS: Gossiping, broadcasting. ...
Mapping Algorithms and Software Environment for Data Parallel PDE . . .
 JOURNAL OF DISTRIBUTED AND PARALLEL COMPUTING
, 1994
"... We consider computations associated with data parallel iterative solvers used for the numerical solution of Partial Differential Equations (PDEs). The mapping of such computations into load balanced tasks requiring minimum synchronization and communication is a difficult combinatorial optimization p ..."
Abstract

Cited by 35 (20 self)
 Add to MetaCart
We consider computations associated with data parallel iterative solvers used for the numerical solution of Partial Differential Equations (PDEs). The mapping of such computations into load balanced tasks requiring minimum synchronization and communication is a difficult combinatorial optimization problem. Its optimal solution is essential for the efficient parallel processing of PDE computations. Determining data mappings that optimize a number of criteria, likeworkload balance, synchronization and local communication, often involves the solution of an NPComplete problem. Although data mapping algorithms have been known for a few years there is lack of qualitative and quantitative comparisons based on the actual performance of the parallel computation. In this paper we present two new data mapping algorithms and evaluate them together with a large number of existing ones using the actual performance of data parallel iterative PDE solvers on the nCUBE II. Comparisons on the performance of data parallel iterative PDE solvers on medium and large scale problems demonstrate that some computationally inexpensive data block partitioning algorithms are as effective as the computationally expensive deterministic optimization algorithms. Also, these comparisons demonstrate that the existing approach in solving the data partitioning problem is inefficient for large scale problems. Finally, a software environment for the solution of the partitioning problem of data parallel iterative solvers is presented.
Building a HighPerformance Collective Communication Library
 In Supercomputing'94, Washington D. C
"... In this paper, we report on a project to develop a unified approach for building a library of collective communication operations that performs well on a crosssection of problems encountered in real applications. The target architecture is a twodimensional mesh with wormhole routing, but the tech ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
In this paper, we report on a project to develop a unified approach for building a library of collective communication operations that performs well on a crosssection of problems encountered in real applications. The target architecture is a twodimensional mesh with wormhole routing, but the techniques are more general. The approach differs from traditional library implementations in that we address the need for implementations that perform well for various sized vectors and grid dimensions, including nonpoweroftwo grids. We show how a general approach to hybrid algorithms yields performance across the entire range of vector lengths. Moreover, many scalable implementations of application libraries require collective communication within groups of nodes. Our approach yields the same kind of performance for group collective communication. Results from the Intel Paragon system are included. To obtain this library for Intel systems contact intercom@cs.utexas.edu. 1 Introduction The I...
Selected problems of scheduling tasks in multiprocessor computing systems
 PHD THESIS, INSTYTUT INFORMATYKI POLITECHNIKA POZNANSKA
, 1997
"... ..."
A Cost Analysis for a Higherorder Parallel Programming Model
, 1996
"... Programming parallel computers remains a difficult task. An ideal programming environment should enable the user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate lowlevel details without sacrificing performance. This thesis investiga ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Programming parallel computers remains a difficult task. An ideal programming environment should enable the user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate lowlevel details without sacrificing performance. This thesis investigates a model of parallel programming based on the BirdMeertens Formalism (BMF). This is a set of higherorder functions, many of which are implicitly parallel. Programs are expressed in terms of functions borrowed from BMF. A parallel implementation is defined for each of these functions for a particular topology, and the associated execution costs are derived. The topologies which have been considered include the hypercube, 2D torus, tree and the linear array. An analyser estimates the costs associated with different implementations of a given program and selects a costeffective one for a given topology. All the analysis is performed at compiletime which has the advantage of reducing run...
Optimal Communication Primitives On The Generalized Hypercube Network
 Journal of Parallel and Distributed Computing
, 1994
"... Efficient interprocessor communication is crucial to increasing the performance of parallel computers. In this paper, a special framework is developed on the generalized hypercube, a network that is currently receiving considerable attention. Using this framework as the basic tool, a number of spann ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Efficient interprocessor communication is crucial to increasing the performance of parallel computers. In this paper, a special framework is developed on the generalized hypercube, a network that is currently receiving considerable attention. Using this framework as the basic tool, a number of spanning graphs with special properties to fit various communication needs, are constructed on the network. The importance of these spanning graphs is demonstrated with the development of optimal algorithms for four fundamental communication problems, namely, the single node and multinode broadcasting and the single node and multinode scattering, on the generalized hypercube network. Broadcasting is the distribution of the same group of messages from a source processor to all other processors, and scattering is the distribution of distinct groups of messages from a source processor to each other processor. We consider broadcasting and scattering from a single processor of the network (single nod...
LOCCS: Low Overhead Communication and Computation Subroutines
, 1993
"... Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
Our aim is to provide one set of efficient basic subroutines for scientific computing which include both communications and computations. The overlap of communications and computations is done using asynchronous pipelining to minimize the overhead due to communications. With this set of routines, we provide to the user of parallel machines an easy SPMD type and efficient way of programming. The main purpose of theses routines is to be used in linear algebra applications but also in other fields like image processing or neural networks. This work was partially supported by ARCHIPEL S.A. under contract 820542, by the CNRS and the DRET 1 Introduction Libraries of routines have been proven to be the only way for efficient and secure programming. In scientific parallel computing, the most commonly used libraries are the BLAS, BLACS, PICL and the one provided by vendors. These building blocks allow the portability of codes and an efficient implementation on different machines. The devel...
Scattering and Gathering Messages in Networks of Processors
, 1993
"... The operations of scattering and gathering in a network of processors involve one processor of the network  call it P 0  communicating with all other processors. In scattering, P 0 sends (possibly) distinct messages to all other processors; in gathering, the other processors send (possibly) di ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
The operations of scattering and gathering in a network of processors involve one processor of the network  call it P 0  communicating with all other processors. In scattering, P 0 sends (possibly) distinct messages to all other processors; in gathering, the other processors send (possibly) distinct messages to P 0 . We consider networks that are trees of processors; we present algorithms for scattering messages from and gathering messages to the processor that resides at the root of the tree. The algorithms are: ffl quite general, in that the messages transmitted can differ arbitrarily in length; ffl quite strong, in that they send messages along noncolliding paths, hence do not require any buffering or queuing mechanisms in the processors; ffl quite efficient: the algorithms for scattering in general trees are optimal, the algorithm for gathering in a path is optimal, and the algorithms for gathering in general trees are nearly optimal. Our algorithms can easily be converte...
Efficient AlltoAll Broadcast in AllPort Mesh and Torus Networks
 Proceedings of 5th IEEE International Symposium on HighPerformance Computer Architecture (HPCA5
, 1999
"... Alltoall communication is oneof themost densecommunication patterns and occurs in many important applications in parallel computing. In this paper, we present a new alltoall broadcast algorithm in allport mesh and torus networks. Unlike existing alltoall broadcast algorithms, the new algorithm ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Alltoall communication is oneof themost densecommunication patterns and occurs in many important applications in parallel computing. In this paper, we present a new alltoall broadcast algorithm in allport mesh and torus networks. Unlike existing alltoall broadcast algorithms, the new algorithm takes advantage of overlapping of message switching time and transmission time, and achieves optimal transmission time for alltoall broadcast. In addition, in most cases, the total communication delay is close to the lower bound of alltoall broadcast withina small constant range. Finally, the algorithm is conceptually simple, and symmetrical for every message and every node so that it can be easily implemented in hardware and achieves the optimum in practice. Index Terms: Parallel computing, collective communication, alltoall communication, alltoall broadcast, routing, interprocessor communication. 1 Introduction Collective communication [1] involves global data movement and glob...
Performance complexity of LU factorization with efficient pipelining and overlap on a multiprocessor
, 1994
"... In this paper, we make efficient use of pipelining on LU decomposition with pivoting and a columnscattered data decomposition to derive precise variations of the computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines. ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
In this paper, we make efficient use of pipelining on LU decomposition with pivoting and a columnscattered data decomposition to derive precise variations of the computational complexities. We then compare these results with experiments on the Intel iPSC/860 and Paragon machines.