Results 1  10
of
47
LogGP: Incorporating Long Messages into the LogP Model  One step closer towards a realistic model for parallel computation
, 1995
"... We present a new model of parallel computationthe LogGP modeland use it to analyze a number of algorithms, most notably, the single node scatter (onetoall personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation [CKP + 93] which abstracts the comm ..."
Abstract

Cited by 235 (1 self)
 Add to MetaCart
We present a new model of parallel computationthe LogGP modeland use it to analyze a number of algorithms, most notably, the single node scatter (onetoall personalized broadcast). The LogGP model is an extension of the LogP model for parallel computation [CKP + 93] which abstracts the communication of fixedsized short messages through the use of four parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P ). As evidenced by experimental data, the LogP model can accurately predict communication performance when only short messages are sent (as on the CM5) [CKP + 93, CDMS94]. However, many existing parallel machines have special support for long messages and achieve a much higher bandwidth for long messages compared to short messages (e.g., IBM SP2, Paragon, Meiko CS2, Ncube/2). We extend the basic LogP model with a linear model for long messages. This combination, which we call the LogGP model of parallel computation, has o...
Development of Parallel Methods for a 1024Processor Hypercube
 SIAM Journal on Scientific and Statistical Computing
, 1988
"... paper. JLG 1995) ..."
Parallel Sorting by Regular Sampling
, 1992
"... A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p process ..."
Abstract

Cited by 101 (7 self)
 Add to MetaCart
A new parallel sorting algorithm suitable for MIMD multiprocessors is presented. The algorithm reduces memory and bus contention, which many parallel sorting algorithms suffer from, by using a regular sampling of the data to ensure good pivot selection. For n data elements to be sorted and p processors, when n p 3 the algorithm is shown to be asymptotically optimal. In theory, the algorithm is within a factor of two of achieving ideal load balancing. In practice, there is almost perfect partitioning of work. On a variety of shared and distributed memory machines, the algorithm achieves better than halflinear speedups.  1. Introduction Sorting is one of most studied problems in computer science because of its theoretical interest and practical importance. With the advent of parallel processing, parallel sorting has become an important area for algorithm research. Although considerable work has been done on the theory of parallel sorting and efficient implementations on SIMD arch...
Adaptive FaultTolerant Routing in Hypercube Multicomputers
 IEEE Transactions on Computers
, 1990
"... A connected hypercube with faulty links and/or nodes is called an injured hypercube. To enable any nonfaulty node to communicate with any other nonfaulty node in an injured hypcrcube, the information on component failures has to be made available to nonfaulty nodes so as to route messages around ..."
Abstract

Cited by 31 (2 self)
 Add to MetaCart
A connected hypercube with faulty links and/or nodes is called an injured hypercube. To enable any nonfaulty node to communicate with any other nonfaulty node in an injured hypcrcube, the information on component failures has to be made available to nonfaulty nodes so as to route messages around the faulty components. We propose first a distributed adaptive faulttolerant routing scheme for an injured hypercube in which each node is required to know only the condition of its own links. Despite its simplicity, this scheme is shown to be capable of routing messages successfully in an injured hypercube as long as the number of faulty components is less than n. Moreover, it is proved that this scheme routes messages via shortest paths with a rather high probability and the expected length of a resulting path is very close to that of a shortest path. Since the assumption that the number of faulty components is less than n in an ndimensional hypercube might limit the usefulness of the above scheme, we also introduce a routing scheme based on depthfirst search which works in the presence of an arbitrary number of faulty components. Due to the insufficient information on faulty components, however, the paths chosen by the above scheme may not always be the shortest. To guarantee all messages to be routed via shortest paths, we propose to equip every node with more information than that on its own links. The effects of this additional information on routing efficiency are analyzed, and the additional information to be kept at each node for the shortest path routing is determined. Several examples and remarks are also given to illustrate bur results. Index Terms: Injured and regular hypercubes, distributed adaptive faulttolerant routing, dcpthfirst search, looping effects, network delay tables, failure information.
Intensive Hypercube Communication: Prearranged Communication in LinkBound Machines
 Journal of Parallel and Distributed Computing
, 1990
"... Hypercube algorithms are developed for a variety of communicationintensive tasks such as transposing a matrix, histogramming, one node sending a (long) message to another, broadcasting a message from one node to all others, each node broadcasting a message to all others, and nodes exchanging messag ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
Hypercube algorithms are developed for a variety of communicationintensive tasks such as transposing a matrix, histogramming, one node sending a (long) message to another, broadcasting a message from one node to all others, each node broadcasting a message to all others, and nodes exchanging messages via a fixed permutation. The algorithm for exchanging via a fixed permutation can be viewed as a deterministic analogue of Valiant's randomized routing. The algorithms are for linkbound hypercubes in which local processing time is ignored, communication time predominates, message headers are not needed because all nodes know the task being performed, and all nodes can use all communication links simultaneously. Through systematic use of techniques such as pipelining, batching, variable packet sizes, symmetrizing, and completing, for all problems algorithms are obtained which achieve a time with an optimal highestorder term. 1 Introduction This paper gives efficient hypercube algorith...
Packet Routing In FixedConnection Networks: A Survey
, 1998
"... We survey routing problems on fixedconnection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, krelation routing, routing to random destinations, dynamic routing, isotonic routing ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
We survey routing problems on fixedconnection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, krelation routing, routing to random destinations, dynamic routing, isotonic routing, fault tolerant routing, and related sorting results. We also provide a list of unsolved problems and numerous references.
A Cost Analysis for a Higherorder Parallel Programming Model
, 1996
"... Programming parallel computers remains a difficult task. An ideal programming environment should enable the user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate lowlevel details without sacrificing performance. This thesis investiga ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Programming parallel computers remains a difficult task. An ideal programming environment should enable the user to concentrate on the problem solving activity at a convenient level of abstraction, while managing the intricate lowlevel details without sacrificing performance. This thesis investigates a model of parallel programming based on the BirdMeertens Formalism (BMF). This is a set of higherorder functions, many of which are implicitly parallel. Programs are expressed in terms of functions borrowed from BMF. A parallel implementation is defined for each of these functions for a particular topology, and the associated execution costs are derived. The topologies which have been considered include the hypercube, 2D torus, tree and the linear array. An analyser estimates the costs associated with different implementations of a given program and selects a costeffective one for a given topology. All the analysis is performed at compiletime which has the advantage of reducing run...
Optimal Communication Primitives On The Generalized Hypercube Network
 Journal of Parallel and Distributed Computing
, 1994
"... Efficient interprocessor communication is crucial to increasing the performance of parallel computers. In this paper, a special framework is developed on the generalized hypercube, a network that is currently receiving considerable attention. Using this framework as the basic tool, a number of spann ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
Efficient interprocessor communication is crucial to increasing the performance of parallel computers. In this paper, a special framework is developed on the generalized hypercube, a network that is currently receiving considerable attention. Using this framework as the basic tool, a number of spanning graphs with special properties to fit various communication needs, are constructed on the network. The importance of these spanning graphs is demonstrated with the development of optimal algorithms for four fundamental communication problems, namely, the single node and multinode broadcasting and the single node and multinode scattering, on the generalized hypercube network. Broadcasting is the distribution of the same group of messages from a source processor to all other processors, and scattering is the distribution of distinct groups of messages from a source processor to each other processor. We consider broadcasting and scattering from a single processor of the network (single nod...
An Efficient DelayOptimal Distributed Termination Detection Algorithm
 IN JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING (JPDC
, 2001
"... One of the important issues to be addressed when solving problems on parallel machines or distributed systems is that of efficient termination detection. Numerous schemes with different performance characteristics have been proposed in the past for this purpose. These schemes, while being efficie ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
One of the important issues to be addressed when solving problems on parallel machines or distributed systems is that of efficient termination detection. Numerous schemes with different performance characteristics have been proposed in the past for this purpose. These schemes, while being efficient with regard to one performance metric, prove to be inefficient in terms of other metrics. A signicant drawback shared by all previous methods is that they may take as long as (P ) time to detect and signal termination after its actual occurrence, where P is the total number of processing elements. Detection delay is arguably the most important metric to optimize, since it is directly related to the amount of idling of computing resources and to the delay in the utilization of results of the underlying computation. In this paper, we present a novel termination detection algorithm that is simultaneously optimal or nearoptimal with respect to all relevant performance measures on any topology. In particular, our algorithm has a bestcase detection delay of (1) and a nite optimal worstcase detection delay on any topology equal in order terms to the time for an optimal onetoall broadcast on that topologywe derive a general expression for an optimal onetoall broadcast on an arbitrary topology, which is an interesting new result in itself. On kary ncube tori and meshes, the worstcase delay is (D), where D is the diameter of the architecture. Further, our algorithm has message and computational complexities of O(max(MD;P )) ((max(M;P )) on the average for most applicationsthe same as other messageecient algorithms) and an optimal space complexity of (P ), where M is the total number of messages used by the underlying computation. We also give a scheme using...