Results 1  10
of
30
Accounting for memory bank contention and delay in highbandwidth multiprocessors
 In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1997
"... Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several sharedmemory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulksynchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)BSP. We show experimentally that the (d, x)BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machinespecific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)BSP as a bridging model for emulating a very highlevel abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)BSP.
BSP vs LogP
, 1996
"... A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the socalled stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two model ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the socalled stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two models are derived, showing their substantial equivalence for algorithmic design guided by asymptotic analysis. It is also shown that the two models can be implemented with similar performance on most pointtopoint networks. In conclusion, within the limits of our analysis that is mainly of an asymptotic nature, BSP and (stallfree) LogP can be viewed as closely related variants within the bandwidthlatency framework for modeling parallel computation. BSP seems somewhat preferable due to its greater simplicity and portability, and slightly greater power. LogP lends itself more naturally to multiuser mode.
Cacheefficient Dynamic Programming Algorithms for Multicores
, 2008
"... We present cacheefficient chip multiprocessor (CMP) algorithms with good speedup for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: DCMP with a private cache for each core, SCMP with a single cache shared by all cores, and Multicore, which h ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
We present cacheefficient chip multiprocessor (CMP) algorithms with good speedup for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: DCMP with a private cache for each core, SCMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. We derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem. For each class of problems, we develop a generic CMP algorithm with an associated tiling sequence. We then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cacheefficient parallel execution up to the critical path length of the underlying dynamic programming algorithm. We present experimental results on an 8core Opteron for two sequence alignment problems that are important examples of LDDP. Our experimental results show good speedups for simple versions of our algorithms.
Fast SharedMemory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
Using PRAM Algorithms on a UniformMemoryAccess SharedMemory Architecture
 Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science
, 2001
"... The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the res ..."
Abstract

Cited by 20 (11 self)
 Add to MetaCart
The ability to provide uniform sharedmemory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform sharedmemory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of sharedmemory parallel algorithms.
Concurrent Threads and Optimal Parallel Minimum Spanning Trees Algorithm
 J. ACM
, 2001
"... This paper resolves a longstanding open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in O(log n) time. Specically, we present a new algorithm to so ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
This paper resolves a longstanding open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in O(log n) time. Specically, we present a new algorithm to solve these problems in O(log n) time using a linear number of processors on the exclusiveread exclusivewrite PRAM. The logarithmic time bound is actually optimal since it is well known that even computing the \OR" of n bits
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a perprocessor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems
CommunicationOptimal Parallel Minimum Spanning Tree Algorithms
, 1998
"... Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first nontrivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first nontrivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case, a total of \Omega\Gamma \Delta min(m; pn)) bits need to be communicated in order to solve the MST problem, where is the number of bits required to represent a single edge weight. This implies that if each message communicates at most bits, any BSP algorithm for finding an MST requires communication time \Omega\Gamma g \Delta min(m=p; n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms with communication requirements that match our lower bound in different situations. Both algorithms perform linear work for appropriate values of n, m and p, and use a numbe...
A randomized timework optimal parallel algorithm for finding a minimum spanning forest
 SIAM J. COMPUT
, 1999
"... We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the fi ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures. We also give a simple, general processor allocation scheme for treelike computations.
PRO: a model for Parallel ResourceOptimal computation
 IN 16TH ANNUAL INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS. IEEE, THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS
, 2002
"... We present a new parallel computation model that enables the design of resourceoptimal scalable parallel algorithms and simplifies their analysis. The model rests on the novel idea of incorporating relative optimality as an integral part and measuring the quality of a parallel algorithm in terms of ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
We present a new parallel computation model that enables the design of resourceoptimal scalable parallel algorithms and simplifies their analysis. The model rests on the novel idea of incorporating relative optimality as an integral part and measuring the quality of a parallel algorithm in terms of granularity.