## Using PRAM Algorithms on a Uniform-Memory-Access Shared-Memory Architecture (2001)

### Cached

### Download Links

Venue: | Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science |

Citations: | 21 - 11 self |

### BibTeX

@INPROCEEDINGS{Bader01usingpram,

author = {David A. Bader and Ajith Illendula and Bernard M. E. Moret and Nina R. Weisse-bernstein},

title = {Using PRAM Algorithms on a Uniform-Memory-Access Shared-Memory Architecture},

booktitle = {Proc. 5th Int’l Workshop on Algorithm Engineering (WAE 2001), volume 2141 of Lecture Notes in Computer Science},

year = {2001},

pages = {129--144},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors (from 1 to 64) and across the entire range of instance sizes tested. This linear speedup with the number of processors is, to our knowledge, the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is a graph decomposition algorithm that also requires the computation of a spanning tree; this problem is not only of interest in its own right, but is representative of a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

### Citations

635 |
An Introduction to Parallel Algorithms
- JaJa
- 1992
(Show Context)
Citation Context ...shared-memory for a significant number of processors brings us much closer to the ideal parallel computer envisioned over 20 years ago by theoreticians, the Parallel Random Access Machine (PRAM) (see =-=[22,41]-=-) and thus may enable us at long last to take advantage of 20 years of research in PRAM algorithms for various irregular computations. Moreover, as supercomputers increasingly use SMP clusters, SMP co... |

537 |
The input/output complexity of sorting and related problems
- Aggarwal, Vitter
- 1988
(Show Context)
Citation Context ...c information that can replace the heavy barrier with the light-weight one whenever the architecture permits it. 3.2 Complexity Model for Shared-Memory Various cost models have been proposed for SMPs =-=[1,2,3,4,6,16,18,38,45]; -=-we chose the Helman and JáJámodel[18] because it gave us the best match between our analyses and our experimental results. Since the number of processors used in our experiments is relatively small ... |

403 | Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator
- Shewchuk
- 1996
(Show Context)
Citation Context ...ey cannot be dense in the usual sense of the word, but GD graphs are generally fully triangulated. The last graph class generates the constrained Delaunay triangulation (CD) on a set of random points =-=[43]-=-. For the random graphs GA, GB, GC, andGD, the input graph on n = |V | vertices is generated as follows. Random coordinates are picked in the unit square according to a uniform distribution; a Euclide... |

234 | Algorithms for parallel memory I: Two level memories
- Vitter, Shriver
- 1994
(Show Context)
Citation Context ...c information that can replace the heavy barrier with the light-weight one whenever the architecture permits it. 3.2 Complexity Model for Shared-Memory Various cost models have been proposed for SMPs =-=[1,2,3,4,6,16,18,38,45]; -=-we chose the Helman and JáJámodel[18] because it gave us the best match between our analyses and our experimental results. Since the number of processors used in our experiments is relatively small ... |

147 |
Synthesis of Parallel Algorithms
- Reif
- 1993
(Show Context)
Citation Context ...shared-memory for a significant number of processors brings us much closer to the ideal parallel computer envisioned over 20 years ago by theoreticians, the Parallel Random Access Machine (PRAM) (see =-=[22,41]-=-) and thus may enable us at long last to take advantage of 20 years of research in PRAM algorithms for various irregular computations. Moreover, as supercomputers increasingly use SMP clusters, SMP co... |

128 |
A model for hierarchical memory
- Aggarwal, Alpern, et al.
- 1987
(Show Context)
Citation Context ...c information that can replace the heavy barrier with the light-weight one whenever the architecture permits it. 3.2 Complexity Model for Shared-Memory Various cost models have been proposed for SMPs =-=[1,2,3,4,6,16,18,38,45]; -=-we chose the Helman and JáJámodel[18] because it gave us the best match between our analyses and our experimental results. Since the number of processors used in our experiments is relatively small ... |

112 | The uniform memory hierarchy model of computation. Algorithmica
- Alpern, Carter, et al.
- 1994
(Show Context)
Citation Context |

72 | Starfire: Extending the SMP Envelope
- Charlesworth
- 1998
(Show Context)
Citation Context ... other words, message-based architectures are two orders of magnitude slower than the largest SMPs in terms of their worst-case memory access times. The largest SMP architecture to date, the Sun E10K =-=[8]-=-, uses a combination of data crossbar switches, multiple snooping buses, and sophisticated cache handling to achieve UMA across the entire memory. Of course, there remains a large difference between t... |

69 | Non-separable and planar graphs - Whitney - 1932 |

68 | The Influence of Caches on the Performance of Heaps
- LaMarca, Ladner
- 1996
(Show Context)
Citation Context ...ork spent in synchronization (barrier calls). The first two are closely related: good data and task partitioning will ensure good locality; coupling such partitioning with cache-sensitive coding (see =-=[27,28,29,34]-=- for discussions) provides programs that take best advantage ofsPRAM Algorithms on a Shared-Memory Architecture 133 the architecture. Minimizing the work done in synchronization barriers is a fairly s... |

53 | Simple: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs
- Badern, Jaja
- 1999
(Show Context)
Citation Context ...ect-mapped data cache and 8MB of external cache for each processor [8].s138 David A. Bader et al. Our practical programming environment for SMPs is based upon the SMP Node Library component of SIMPLE =-=[5]-=-, which provides a portable framework for describing SMP algorithms using the single-program multiple-data (SPMD) program style. This framework is a software layer built from POSIX threads that allows... |

42 | Can a sharedmemory model serve as a bridging model for parallel computa- ARTICLE IN PRESS Vijaya Ramachandran et al
- Gibbons, Matias, et al.
- 1999
(Show Context)
Citation Context ...ontiguous memory accesses is the first steptowards capturing the effects of the memory hierarchy, since contiguous memory accesses are much more likely to be cache-friendly. The Queuing Shared Memory =-=[15,38]-=- modeltakesinto account both the number of memory accesses and contention at the memory, but does not distinguish between between contiguous versus non-contiguous accesses. In contrast, the complexity... |

41 | Parallel ear decomposition search (EDS) and st-numbering in graphs. Theoret
- Maon, Schieber, et al.
- 1986
(Show Context)
Citation Context ...9]). The sequential algorithm: Ramachandran [37] gave a linear-time algorithm for ear decomposition based on depth-first search. Another sequential algorithm that lends itself to parallelization (see =-=[22,33,37,42]-=-) finds the labels for each edgesPRAM Algorithms on a Shared-Memory Architecture 135 as follows. First, a spanning tree is found for the graph; the tree is then arbitrarily rooted and each vertex is a... |

36 | Towards a discipline of experimental algorithmics
- Moret
- 2002
(Show Context)
Citation Context ...tion on the NPACI Sun E10K with 1 to 32 processors a) on varying problem sizes (top) and b) different sparse graph models with n = 8192 (bottom) make use of the best precepts of algorithm engineering =-=[34]-=- to ensure that our implementations are as efficient as possible. Converting a PRAM algorithm to a parallel program requires us to address three problems: (i) how to partition the tasks (and data) amo... |

32 | Accounting for Memory Bank Contentions and Delay in High-bandwidth Multiprocessors
- Blelloch, Gibbons, et al.
- 1997
(Show Context)
Citation Context |

31 | List ranking and list scan on the Cray C-90
- Reid-Miller
- 1994
(Show Context)
Citation Context ... example of experimental performance analysis for a nontrivial parallel implementation. 2 Related Work Several groups have conducted experimental studies of graph algorithms on parallel architectures =-=[19,20,26,39,40,44]-=-. Their approach to producing a parallel program is similar to ours (especially that of Ramachandran et al. [17]), but their test platforms have not provided them with a true, scalable, UMA sharedmemo... |

25 | Parallel open ear decomposition with applications to graph biconnectivity and triconnectivity
- Ramachandran
- 1993
(Show Context)
Citation Context ...d that the problem of computing an open ear decomposition in parallel is in NC [30]. Ear decomposition has also been used in designing efficient sequential and parallel algorithms for triconnectivity =-=[37]-=- and 4-connectivity [23]. In addition to graph connectivity, ear decomposition has been used in graph embeddings (see [9]). The sequential algorithm: Ramachandran [37] gave a linear-time algorithm for... |

23 | The queue-read queue-write PRAM model: Accounting for contention in parallel algorithms
- Gibbons, Matias, et al.
- 1999
(Show Context)
Citation Context |

22 | Improved algorithms for graph fourconnectivity
- Kanevsky, Ramachandran
- 1991
(Show Context)
Citation Context ...mputing an open ear decomposition in parallel is in NC [30]. Ear decomposition has also been used in designing efficient sequential and parallel algorithms for triconnectivity [37] and 4-connectivity =-=[23]-=-. In addition to graph connectivity, ear decomposition has been used in graph embeddings (see [9]). The sequential algorithm: Ramachandran [37] gave a linear-time algorithm for ear decomposition based... |

20 | Designing Practical Efficient Algorithms for Symmetric Multiprocessors
- Helman, JáJá
- 1999
(Show Context)
Citation Context |

20 |
Efficient parallel ear decomposition with applications," manuscript
- Miller, Ramachandran
- 1986
(Show Context)
Citation Context ...ned ear labels by choosing the smallest label of any non-tree edge whose cycle contains it. This algorithm runs in O((m + n)logn) time. 4.2 The PRAM Algorithm The PRAM algorithm for ear decomposition =-=[32,33]-=- is based on the second sequential algorithm. The first step computes a spanning tree in O(log n) time, using O(n + m) processors. The tree can then be rooted and levels and parents assigned to nodes ... |

15 |
Computing ears and branchings in parallel
- Lovász
- 1985
(Show Context)
Citation Context ...r decomposition and showed that a graph has an open ear decomposition if and only if it is biconnected [46]. Lovász showed that the problem of computing an open ear decomposition in parallel is in NC=-= [30]-=-. Ear decomposition has also been used in designing efficient sequential and parallel algorithms for triconnectivity [37] and 4-connectivity [23]. In addition to graph connectivity, ear decomposition ... |

12 | Implementation of parallel graph algorithms on a massively parallel SIMD computer with virtual processing
- Hsu, Ramachandran, et al.
- 1995
(Show Context)
Citation Context ... example of experimental performance analysis for a nontrivial parallel implementation. 2 Related Work Several groups have conducted experimental studies of graph algorithms on parallel architectures =-=[19,20,26,39,40,44]-=-. Their approach to producing a parallel program is similar to ours (especially that of Ramachandran et al. [17]), but their test platforms have not provided them with a true, scalable, UMA sharedmemo... |

11 |
The cache performance of traversals and random accesses
- Ladner, Fix, et al.
- 1999
(Show Context)
Citation Context ...ork spent in synchronization (barrier calls). The first two are closely related: good data and task partitioning will ensure good locality; coupling such partitioning with cache-sensitive coding (see =-=[27,28,29,34]-=- for discussions) provides programs that take best advantage ofsPRAM Algorithms on a Shared-Memory Architecture 133 the architecture. Minimizing the work done in synchronization barriers is a fairly s... |

10 | Leonardo: a software visualization system
- Crescenzi, Demetrescu, et al.
- 1997
(Show Context)
Citation Context ...n massively parallel implementations (using a 65,536-processor CM2). Finally, ear decomposition is interesting in its own right, as it is used in a variety of applications from computational geometry =-=[7,10,11,24,25]-=-, structural engineering [12,13], to material physics and molecular biology [14]. The efficient parallel solution of many computational problems often requires approaches that depart completely from t... |

9 | Predicting performance on SMPs. a case study: The SGI Power Challenge
- Amato, Perdue, et al.
- 1999
(Show Context)
Citation Context |

9 | E cient massively parallel implementation of some combinatorial algorithms, Theoretical Computer Science
- Hsu, Ramachandran
- 1996
(Show Context)
Citation Context ... example of experimental performance analysis for a nontrivial parallel implementation. 2 Related Work Several groups have conducted experimental studies of graph algorithms on parallel architectures =-=[19,20,26,39,40,44]-=-. Their approach to producing a parallel program is similar to ours (especially that of Ramachandran et al. [17]), but their test platforms have not provided them with a true, scalable, UMA sharedmemo... |

8 | Experimental evaluation of QSM: a simple shared-memory model
- Grayson, Dahlin, et al.
- 1998
(Show Context)
Citation Context ...ucted experimental studies of graph algorithms on parallel architectures [19,20,26,39,40,44]. Their approach to producing a parallel program is similar to ours (especially that of Ramachandran et al. =-=[17]-=-), but their test platforms have not provided them with a true, scalable, UMA sharedmemory environment or have relied on ad hoc hardware [26]. Thus ours is the first study of speedup over a significan... |

7 | Hammock-on-Ears Decomposition: A Technique for the Efficient Parallel Solution
- Kavvadias, Pantziou, et al.
- 1996
(Show Context)
Citation Context ...n massively parallel implementations (using a 65,536-processor CM2). Finally, ear decomposition is interesting in its own right, as it is used in a variety of applications from computational geometry =-=[7,10,11,24,25]-=-, structural engineering [12,13], to material physics and molecular biology [14]. The efficient parallel solution of many computational problems often requires approaches that depart completely from t... |

7 |
Ranking and List Scan on the Cray C-90
- List
- 1994
(Show Context)
Citation Context |

6 |
Generic rigidity of molecular graphs via ear decomposition
- Franzblau
(Show Context)
Citation Context ...tion is interesting in its own right, as it is used in a variety of applications from computational geometry [7,10,11,24,25], structural engineering [12,13], to material physics and molecular biology =-=[14]-=-. The efficient parallel solution of many computational problems often requires approaches that depart completely from those used for sequential solutions. In the area of graph algorithms, for instanc... |

6 |
An optimal distributed ear decomposition algorithm with applications of biconnectivity and outerplanarity testing
- Kazmierczak, Radhakrishnan
- 2000
(Show Context)
Citation Context ...n massively parallel implementations (using a 65,536-processor CM2). Finally, ear decomposition is interesting in its own right, as it is used in a variety of applications from computational geometry =-=[7,10,11,24,25]-=-, structural engineering [12,13], to material physics and molecular biology [14]. The efficient parallel solution of many computational problems often requires approaches that depart completely from t... |

5 |
Combinatorial algorithm for a lower bound on frame rigidity
- Franzblau
- 1995
(Show Context)
Citation Context ...ing a 65,536-processor CM2). Finally, ear decomposition is interesting in its own right, as it is used in a variety of applications from computational geometry [7,10,11,24,25], structural engineering =-=[12,13]-=-, to material physics and molecular biology [14]. The efficient parallel solution of many computational problems often requires approaches that depart completely from those used for sequential solutio... |

4 |
Parallel Recognition of Series Parallel Graphs
- Eppstein
- 1992
(Show Context)
Citation Context |

4 |
Efficient Parallel Algorithms for Some Graph Problems
- Fast
- 1981
(Show Context)
Citation Context ...9]). The sequential algorithm: Ramachandran [37] gave a linear-time algorithm for ear decomposition based on depth-first search. Another sequential algorithm that lends itself to parallelization (see =-=[22,33,37,42]-=-) finds the labels for each edgesPRAM Algorithms on a Shared-Memory Architecture 135 as follows. First, a spanning tree is found for the graph; the tree is then arbitrarily rooted and each vertex is a... |

3 | Ear Decompositions of Matching Covered Graphs
- Carvalho, Lucchesi, et al.
- 1999
(Show Context)
Citation Context |

3 |
A General-Purpose Shared-Memory Model for Parallel Computation
- Ramachandran
- 1999
(Show Context)
Citation Context |

3 |
trade-offs for parallel list ranking
- Better
- 1997
(Show Context)
Citation Context |

2 |
Efficient Parallel Graph Algorithms Based on Open Ear Decomposition
- Ibarra, Richards
- 1993
(Show Context)
Citation Context ...of Qi, i>0, is contained in some Qj, j<i. – No internal vertex of Qi, (i >0), is contained in any Qj, forj<i. Thus a vertex may belong to more than on ear, but an edge is contained in exactly one ea=-=r [21]-=-. If the endpoints of the ear do not coincide, then the ear is open; otherwise, the ear is closed. Anopen ear decomposition is an ear decomposition in which every ear is open. Figure 5 in the appendix... |

1 |
Graph Ear Decompositions and Graph Embeddings
- Chen, Kanchi
- 1999
(Show Context)
Citation Context ...in designing efficient sequential and parallel algorithms for triconnectivity [37] and 4-connectivity [23]. In addition to graph connectivity, ear decomposition has been used in graph embeddings (see =-=[9]-=-). The sequential algorithm: Ramachandran [37] gave a linear-time algorithm for ear decomposition based on depth-first search. Another sequential algorithm that lends itself to parallelization (see [2... |

1 |
Ear Decomposition with Bounds on Ear Length
- Franzblau
- 1999
(Show Context)
Citation Context ...ing a 65,536-processor CM2). Finally, ear decomposition is interesting in its own right, as it is used in a variety of applications from computational geometry [7,10,11,24,25], structural engineering =-=[12,13]-=-, to material physics and molecular biology [14]. The efficient parallel solution of many computational problems often requires approaches that depart completely from those used for sequential solutio... |

1 |
Non-Separable and Planar Graphs. Transactions of the American Mathematical Society, 34:339–362
- Whitney
- 1932
(Show Context)
Citation Context ...y ear is open. Figure 5 in the appendix illustrates these concepts. Whitney first studied open ear decomposition and showed that a graph has an open ear decomposition if and only if it is biconnected =-=[46].-=- Lovász showed that the problem of computing an open ear decomposition in parallel is in NC [30]. Ear decomposition has also been used in designing efficient sequential and parallel algorithms for tr... |