Results 1  10
of
44
Scan primitives as parallel operations
 IEEE Transactions on Computers
, 1989
"... ..."
(Show Context)
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 102 (2 self)
 Add to MetaCart
(Show Context)
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Optimal Doubly Logarithmic Parallel Algorithms Based On Finding All Nearest Smaller Values
, 1993
"... The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) a ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) and the right nearest smaller element a k (with k ? i). We give an O(log log n) time optimal parallel algorithm for the problem on a CRCW PRAM. We apply this algorithm to achieve optimal O(log log n) time parallel algorithms for four problems: (i) Triangulating a monotone polygon, (ii) Preprocessing for answering range minimum queries in constant time, (iii) Reconstructing a binary tree from its inorder and either preorder or postorder numberings, (vi) Matching a legal sequence of parentheses. We also show that any optimal CRCW PRAM algorithm for the triangulation problem requires \Omega\Gammauir log n) time. Dept. of Computing, King's College London, The Strand, London WC2R 2LS, England. ...
Efficient LowContention Parallel Algorithms
, 1996
"... The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention propert ..."
Abstract

Cited by 32 (13 self)
 Add to MetaCart
(Show Context)
The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models, and can be e ciently emulated with only logarithmic slowdown on hypercubetype noncombining networks. This paper describes fast, lowcontention, workoptimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the highcontention steps typical of crcw pram algorithms. An illustrative experiment demonstrates the performance advantage of a new qrqw random permutation algorithm when compared with the popular erew algorithm. Finally, this paper presents new randomized algorithms for integer sorting and general sorting.
Parallel Algorithms with Optimal Speedup for Bounded Treewidth
 Proceedings 22nd International Colloquium on Automata, Languages and Programming
, 1995
"... We describe the first parallel algorithm with optimal speedup for constructing minimumwidth tree decompositions of graphs of bounded treewidth. On nvertex input graphs, the algorithm works in O((logn)^2) time using O(n) operations on the EREW PRAM. We also give faster parallel algorithms with opti ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
(Show Context)
We describe the first parallel algorithm with optimal speedup for constructing minimumwidth tree decompositions of graphs of bounded treewidth. On nvertex input graphs, the algorithm works in O((logn)^2) time using O(n) operations on the EREW PRAM. We also give faster parallel algorithms with optimal speedup for the problem of deciding whether the treewidth of an input graph is bounded by a given constant and for a variety of problems on graphs of bounded treewidth, including all decision problems expressible in monadic secondorder logic. On nvertex input graphs, the algorithms use O(n) operations together with O(log n log n) time on the EREW PRAM, or O(log n) time on the CRCW PRAM.
On Parallel Hashing and Integer Sorting
, 1991
"... The problem of sorting n integers from a restricted range [1::m], where m is superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = n polylog(n) we have an O(n log log n) algorithm.) The al ..."
Abstract

Cited by 25 (8 self)
 Add to MetaCart
The problem of sorting n integers from a restricted range [1::m], where m is superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = n polylog(n) we have an O(n log log n) algorithm.) The algorithm is parallelizable. The resulting parallel algorithm achieves optimal speed up. Some features of the algorithm make us believe that it is relevant for practical applications. A result of independent interest is a parallel hashing technique. The expected construction time is logarithmic using an optimal number of processors, and searching for a value takes O(1) time in the worst case. This technique enables drastic reduction of space requirements for the price of using randomness. Applicability of the technique is demonstrated for the parallel sorting algorithm, and for some parallel string matching algorithms. The parallel sorting algorithm is designed for a strong and non standard mo...
Oblivious algorithms for multicores and network of processors
, 2009
"... We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multic ..."
Abstract

Cited by 23 (6 self)
 Add to MetaCart
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multicores, and we propose a multicoreoblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicoreoblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including FloydWarshallâ€™s allpairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient networkoblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these networkoblivious algorithms perform efficiently also when executed on the DecomposableBSP.
Parallel Implementation of Tree Skeletons
 Journal of Parallel and Distributed Computing
, 1996
"... Trees are a useful data type, but they are not routinely included in parallel programming systems because their irregular structure makes them seem hard to compute with e ciently. Wepresent a method for constructing implementations of skeletons, highlevel homomorphic operations on trees, that execu ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
Trees are a useful data type, but they are not routinely included in parallel programming systems because their irregular structure makes them seem hard to compute with e ciently. Wepresent a method for constructing implementations of skeletons, highlevel homomorphic operations on trees, that execute in parallel. In particular, we consider the case where the size of the tree is much larger than the the number of processors available, so that tree data must be partitioned. The approach uses the theory of categorical data types to derive implementation templates based on tree contraction. Many useful tree operations can be computed in time logarithmic in the size of their argument, on a wide range of parallel systems. 1 Contribution One common approach to generalpurpose parallel computation is based on packaging complex operations as templates, or skeletons [3, 12]. Skeletons encapsulate the control and data ow necessary to compute useful operations. This permits software to be written in a way that is independent of particular architectures, and indeed of underlying parallelism at all, while freeing implementations
The Owner Concept for PRAMs
, 1991
"... We analyze the owner concept for PRAMs. In OROWPRAMs each memory cell has one distinct processor that is the only one allowed to write into this memory cell and one distinct processor that is the only one allowed to read from it. By symmetric pointer doubling, a new proof technique for OROWPRAMs, ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
We analyze the owner concept for PRAMs. In OROWPRAMs each memory cell has one distinct processor that is the only one allowed to write into this memory cell and one distinct processor that is the only one allowed to read from it. By symmetric pointer doubling, a new proof technique for OROWPRAMs, it is shown that list ranking can be done in O(log n) time by an OROWPRAM and that LOGSPACE ` OROWTIME(log n). Then we prove that OROWPRAMs are a fairly robust model and recognize the same class of languages when the model is modified in several ways and that all kinds of PRAMs intertwine with the NC hierarchy without timeloss. Finally it is shown that EREWPRAMs can be simulated by OREWPRAMs and ERCWPRAMs by ORCWPRAMs. 3 This research was partially supported by the Deutsche Forschungsgemeinschaft, SFB 342, Teilprojekt A4 "Klassifikation und Parallelisierung durch Reduktionsanalyse" y Email: rossmani@lan.informatik.tumuenchen.dbp.de Introduction Fortune and Wyllie introduced in...
Optimal Deterministic Approximate Parallel Prefix Sums and Their Applications
 In Proc. Israel Symp. on Theory and Computing Systems (ISTCS'95
, 1995
"... We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
We show that extremely accurate approximation to the prefix sums of a sequence of n integers can be computed deterministically in O(log log n) time using O(n= log log n) processors in the Common CRCW PRAM model. This complements randomized approximation methods obtained recently by Goodrich, Matias and Vishkin and improves previous deterministic results obtained by Hagerup and Raman. Furthermore, our results completely match a lower bound obtained recently by Chaudhuri. Our results have many applications. Using them we improve upon the best known time bounds for deterministic approximate selection and for deterministic padded sorting. 1 Introduction The computation of prefix sums is one of the most basic tools in the design of fast parallel algorithms (see Blelloch [9] and J'aJ'a [33]). Prefixsums can be computed in O(logn) time and linear work in the EREW PRAM model (Ladner and Fischer [34]) and in O(log n= log log n) and linear work in the Common CRCW PRAM model (Cole and Vishkin...