## Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs) (2002)

### Cached

### Download Links

Venue: | Proc. 9th Int’l Conf. on High Performance Computing (HiPC 2002), volume 2552 of Lecture Notes in Computer Science |

Citations: | 22 - 7 self |

### BibTeX

@INPROCEEDINGS{Bader02evaluatingarithmetic,

author = {David A. Bader and Sukanya Sreshta and Nina R. Weisse-bernstein},

title = {Evaluating Arithmetic Expressions Using Tree Contraction: A Fast and Scalable Parallel Implementation for Symmetric Multiprocessors (SMPs)},

booktitle = {Proc. 9th Int’l Conf. on High Performance Computing (HiPC 2002), volume 2552 of Lecture Notes in Computer Science},

year = {2002},

pages = {63--75},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

The ability to provide uniform shared-memory access to a significant number of processors in a single SMP node brings us much closer to the ideal PRAM parallel computer. In this paper, we develop new techniques for designing a uniform shared-memory algorithm from a PRAM algorithm and present the results of an extensive experimental study demonstrating that the resulting programs scale nearly linearly across a significant range of processors and across the entire range of instance sizes tested. This linear speedup with the number of processors is one of the first ever attained in practice for intricate combinatorial problems. The example we present in detail here is for evaluating arithmetic expression trees using the algorithmic techniques of list ranking and tree contraction; this problem is not only of interest in its own right, but is representativeof a large class of irregular combinatorial problems that have simple and efficient sequential implementations and fast PRAM algorithms, but have no known efficient parallel implementations. Our results thus offer promise for bridging the gap between the theory and practice of shared-memory parallel algorithms.

### Citations

673 |
An Introduction to Parallel Algorithms
- JáJá
- 1992
(Show Context)
Citation Context ... Bader et al. for a significant number of processors brings us much closer to the ideal parallel computer envisioned over 20 years ago by theoreticians, the Parallel Random Access Machine (PRAM) (see =-=[13, 19]-=-)and thus may enable us at last to take advantage of 20 years of research in PRAM algorithms for various irregular computations. Moreover, as supercomputers increasingly use SMP clusters, SMP computat... |

149 |
Synthesis of Parallel Algorithms
- Reif
- 1993
(Show Context)
Citation Context ... Bader et al. for a significant number of processors brings us much closer to the ideal parallel computer envisioned over 20 years ago by theoreticians, the Parallel Random Access Machine (PRAM) (see =-=[13, 19]-=-)and thus may enable us at last to take advantage of 20 years of research in PRAM algorithms for various irregular computations. Moreover, as supercomputers increasingly use SMP clusters, SMP computat... |

122 |
Parallel tree contraction and its application
- Miller, Reif
- 1985
(Show Context)
Citation Context ...f the expression at the root of the tree. Hence, AEE is a direct application of the well studied tree contraction technique, a systematic way of shrinking a tree into a single vertex. Miller and Reif =-=[16]-=- designed an exclusive-read exclusive-write (EREW) PRAM algorithm for evaluating any arithmetic expression of size n, which runs in O(log n)time using O(n)processors (with O(n log n)work). Subsequentl... |

75 | Starfire: Extending the SMP Envelope
- Charlesworth
- 1998
(Show Context)
Citation Context ...wer than the largest SMPs in terms of their worst-case memory access times. The largest SMP architecture to date, the Sun E15K [5] (a system three- to five-times faster than its predecessor, the E10K =-=[4]-=-), uses a combination of data crossbar switches, multiple snooping buses, and sophisticated cache handling to achieve UMA across the entire memory. Of course, there remains a large difference between ... |

64 | Ecient parallel graph algorithms for coarse grained multicomputers and BSP
- Caceres, Dehne, et al.
- 1997
(Show Context)
Citation Context ...oped a simplified version of the algorithm in [8] which runs in the same time-processor bounds (O(log n)-time O(n/ log n)-processors (with O(n)work)on the EREW PRAM). Recently, several researchers in =-=[3, 7] pre-=-sent theoretic observation that this classical PRAM algorithm for tree contraction on a tree T with n vertices can run on the CoarseGrained Multicomputer (CGM)parallel � machine � model with p pro... |

55 | SIMPLE: A methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs
- Bader, JáJá
- 1999
(Show Context)
Citation Context ...odel used for analyzing the SMP algorithm and a thorough analysis of the algorithm. SMP Libraries Our practical programming environment for SMPs is based upon the SMP Node Library component of SIMPLE =-=[1]-=-, that provides a portable framework for describing SMP algorithms using the single-program multipledata (SPMD)program style. This framework is a software layer built from POSIX threads that allows th... |

49 |
Optimal parallel evaluation of tree-structured computations by raking
- Kosaraju, Delcher
- 1988
(Show Context)
Citation Context ...th O(n log n)work). Subsequently Cole and Vishkin [6] and Gibbons and Rytter [8] independently developed O(log n)-time O(n/ log n)-processors (with O(n)work)EREW PRAM algorithms. Kosaraju and Delcher =-=[15]-=- developed a simplified version of the algorithm in [8] which runs in the same time-processor bounds (O(log n)-time O(n/ log n)-processors (with O(n)work)on the EREW PRAM). Recently, several researche... |

41 | Vishkin U, “The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time,” Algorithmica
- Cole
- 1988
(Show Context)
Citation Context ...ive-read exclusive-write (EREW) PRAM algorithm for evaluating any arithmetic expression of size n, which runs in O(log n)time using O(n)processors (with O(n log n)work). Subsequently Cole and Vishkin =-=[6]-=- and Gibbons and Rytter [8] independently developed O(log n)-time O(n/ log n)-processors (with O(n)work)EREW PRAM algorithms. Kosaraju and Delcher [15] developed a simplified version of the algorithm ... |

40 |
The Sun Fireplane System Interconnect
- Charlesworth
- 2001
(Show Context)
Citation Context ... other words, message-based architectures are two orders of magnitude slower than the largest SMPs in terms of their worst-case memory access times. The largest SMP architecture to date, the Sun E15K =-=[5]-=- (a system three- to five-times faster than its predecessor, the E10K [4]), uses a combination of data crossbar switches, multiple snooping buses, and sophisticated cache handling to achieve UMA acros... |

33 | List ranking and list scan on the Cray C-90
- Reid-Miller
- 1994
(Show Context)
Citation Context ...g p)communication rounds with O local computation per round. 2 Related Experimental Work Several groups have conducted experimental studies of graph algorithms on parallel architectures (for example, =-=[11, 12, 14, 18, 20, 9]-=-). However, none of these related works use test platforms that provide a true, scalable, UMA sharedmemory environment and still other studies have relied on ad hoc hardware [14]. Thus ours is the fir... |

22 |
Optimal parallel algorithm for dynamic expression evaluation and context-free recognition
- Gibbons, Rytter
- 1989
(Show Context)
Citation Context ...REW) PRAM algorithm for evaluating any arithmetic expression of size n, which runs in O(log n)time using O(n)processors (with O(n log n)work). Subsequently Cole and Vishkin [6] and Gibbons and Rytter =-=[8]-=- independently developed O(log n)-time O(n/ log n)-processors (with O(n)work)EREW PRAM algorithms. Kosaraju and Delcher [15] developed a simplified version of the algorithm in [8] which runs in the sa... |

20 | Using PRAM algorithms on a uniform-memoryaccess shared-memory architecture
- Bader, Illendula, et al.
- 2001
(Show Context)
Citation Context ... [14]. Thus ours is the first study of speedup for over tens of processors (and promise to scale over a significant range of processors)on a commercially available platform. In a recent work of ours (=-=[2]-=-)we study the problem of decomposing graphs with the ear decomposition using similar shared-memory platforms. p 1000 ns 100 ns 10 ns C, 1 C, 2 C, 4 C, 8 C, 16 C, 32 C, 64 Rs66 David A. Bader et al. Ou... |

20 | Designing Practical Efficient Algorithms for Symmetric Multiprocessors
- HELMAN, JÀJÀ
- 1999
(Show Context)
Citation Context ... its successor. This is followed by list ranking to obtain a consecutive labeling of the leaves. Our implementation uses the SMP list ranking algorithm and implementation developed by Helman and JáJ�=-=� [10] tha-=-t performs the following main steps: 1. Finding the head h of the list which is given by h =(n(n−1)/2−Z)where Z is the sum of successor indices of all the nodes in the list. 2. Partitioning the in... |

12 | Implementation of parallel graph algorithms on a massively parallel SIMD computer with virtual processing
- Hsu, Ramachandran, et al.
- 1995
(Show Context)
Citation Context ...g p)communication rounds with O local computation per round. 2 Related Experimental Work Several groups have conducted experimental studies of graph algorithms on parallel architectures (for example, =-=[11, 12, 14, 18, 20, 9]-=-). However, none of these related works use test platforms that provide a true, scalable, UMA sharedmemory environment and still other studies have relied on ad hoc hardware [14]. Thus ours is the fir... |

9 | Experimental evaluation of QSM: A simple shared-memory model
- Grayson, Dahlin, et al.
- 1998
(Show Context)
Citation Context ...g p)communication rounds with O local computation per round. 2 Related Experimental Work Several groups have conducted experimental studies of graph algorithms on parallel architectures (for example, =-=[11, 12, 14, 18, 20, 9]-=-). However, none of these related works use test platforms that provide a true, scalable, UMA sharedmemory environment and still other studies have relied on ad hoc hardware [14]. Thus ours is the fir... |

9 | E cient massively parallel implementation of some combinatorial algorithms, Theoretical Computer Science
- Hsu, Ramachandran
- 1996
(Show Context)
Citation Context |

4 |
trade-offs for parallel list ranking
- Better
- 1997
(Show Context)
Citation Context |