Results 1  10
of
47
Provably efficient scheduling for languages with finegrained parallelism
 IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract

Cited by 92 (27 self)
 Add to MetaCart
Many highlevel parallel programming languages allow for finegrained parallelism. As in the popular worktime framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
Improved Parallel Integer Sorting without Concurrent Writing
, 1992
"... We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary ..."
Abstract

Cited by 48 (5 self)
 Add to MetaCart
We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary given t log n. In addition, we are able to sort n arbitrary integers on a randomized CREW PRAM within the same resource bounds with high probability. In each case our algorithm is a factor of almost \Theta( p log n) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best previous sequential algorithm. We also show that n integers in the range 1 : : m can be sorted in O((log n) 2 ) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(log n log log n log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear t...
Doubly Logarithmic Communication Algorithms for Optical Communication Parallel Computers
 In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1994
"... In this paper we consider the problem of interprocessor communication on parallel computers that have optical communication networks. We consider the Completely Connected Optical Communication Parallel Computer (OCPC), which has a completely connected optical network and also the Mesh of Optical Bus ..."
Abstract

Cited by 41 (5 self)
 Add to MetaCart
In this paper we consider the problem of interprocessor communication on parallel computers that have optical communication networks. We consider the Completely Connected Optical Communication Parallel Computer (OCPC), which has a completely connected optical network and also the Mesh of Optical Buses Parallel Computer (MOBPC) , which has a mesh of optical buses as its communication network. The particular communication problem that we study is that of realizing an hrelation. In this problem, each processor has at most h messages to send and at most h messages to receive. It is clear that any 1relation can be realized in one communication step on an OCPC. However, the best previously known pprocessor OCPC algorithm for realizing an arbitrary hrelation for h ? 1 requires \Theta(h + log p) expected communication steps. (This algorithm is due to Valiant and is based on earlier work of Anderson and Miller.) Valiant's algorithm is optimal only for h = \Omega\Gamma139 p) and it is an op...
An optical simulation of shared memory
, 1994
"... We present a workoptimal randomized algorithm for simulating a shared memory machine (pram) on an optical communication parallel computer (ocpc). The ocpc model is motivated by the potential of optical communication for parallel computation. The memory of an ocpc is divided into modules, one module ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
We present a workoptimal randomized algorithm for simulating a shared memory machine (pram) on an optical communication parallel computer (ocpc). The ocpc model is motivated by the potential of optical communication for parallel computation. The memory of an ocpc is divided into modules, one module per processor. Each memory module only services a request on a timestep if it receives exactly one memory request. Our algorithm simulates each step of an n lg lg nprocessor erew pram on an nprocessor ocpc in O(lg lg n) expected delay. (The probability that the delay is longer than this is at most n; for any constant.) The best previous simulation, due to Valiant, required (lg n) expected delay.
Efficient LowContention Parallel Algorithms
, 1996
"... The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention propert ..."
Abstract

Cited by 34 (14 self)
 Add to MetaCart
(Show Context)
The queueread, queuewrite (qrqw) parallel random access machine (pram) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The qrqw pram model re ects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models, and can be e ciently emulated with only logarithmic slowdown on hypercubetype noncombining networks. This paper describes fast, lowcontention, workoptimal, randomized qrqw pram algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting. These logarithmic or sublogarithmic time algorithms considerably improve upon the best known erew pram algorithms for these problems, while avoiding the highcontention steps typical of crcw pram algorithms. An illustrative experiment demonstrates the performance advantage of a new qrqw random permutation algorithm when compared with the popular erew algorithm. Finally, this paper presents new randomized algorithms for integer sorting and general sorting.
The QueueRead QueueWrite PRAM Model: Accounting for Contention in Parallel Algorithms
 Proc. 5th ACMSIAM Symp. on Discrete Algorithms
, 1997
"... Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to thi ..."
Abstract

Cited by 28 (11 self)
 Add to MetaCart
(Show Context)
Abstract. This paper introduces the queueread queuewrite (qrqw) parallel random access machine (pram) model, which permits concurrent reading and writing to sharedmemory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The qrqw pram model reflects the contention properties of most commercially available parallel machines more accurately than either the wellstudied crcw pram or erew pram models: the crcw model does not adequately penalize algorithms with high contention to sharedmemory locations, while the erew model is too strict in its insistence on zero contention at each step. The�qrqw pram is strictly more powerful than the erew pram. This paper shows a separation of log n between the two models, and presents faster and more efficient qrqw algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a workpreserving emulation of the qrqw pram with only logarithmic slowdown on Valiant’s bsp model, and hence on hypercubetype noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the bestknown emulation result for the erew pram, and considerably improves upon the bestknown efficient emulation for the crcw pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
Realtime parallel hashing on the gpu
 In ACM SIGGRAPH Asia 2009 papers, SIGGRAPH ’09
, 2009
"... Figure 1: Overview of our construction for a voxelized Lucy model, colored by mapping x, y, and z coordinates to red, green, and blue respectively (far left). The 3.5 million voxels (left) are input as 32bit keys and placed into buckets of ≤ 512 items, averaging 409 each (center). Each bucket then ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
Figure 1: Overview of our construction for a voxelized Lucy model, colored by mapping x, y, and z coordinates to red, green, and blue respectively (far left). The 3.5 million voxels (left) are input as 32bit keys and placed into buckets of ≤ 512 items, averaging 409 each (center). Each bucket then builds a cuckoo hash with three subtables and stores them in a larger structure with 5 million entries (right). Closeups follow the progress of a single bucket, showing the keys allocated to it (center; the bucket is linear and wraps around left to right) and each of its completed cuckoo subtables (right). Finding any key requires checking only three possible locations. We demonstrate an efficient dataparallel algorithm for building large hash tables of millions of elements in realtime. We consider two parallel algorithms for the construction: a classical sparse perfect hashing approach, and cuckoo hashing, which packs elements densely by allowing an element to be stored in one of multiple possible locations. Our construction is a hybrid approach that uses both algorithms. We measure the construction time, access time, and memory usage of our implementations and demonstrate realtime performance on large datasets: for 5 million keyvalue pairs, we construct a hash table in 35.7 ms using 1.42 times as much memory as the input data itself, and we can access all the elements in that hash table in 15.3 ms. For comparison, sorting the same data requires 36.6 ms, but accessing all the elements via binary search requires 79.5 ms. Furthermore, we show how our hashing methods can be applied to two graphics applications: 3D surface intersection for moving data and geometric hashing for image matching.
Delayed path coupling and generating random permutations via distributed stochastic processes
, 1999
"... We analyze various stochastic processes for generating permutations almost uniformly at random in distributed and parallel systems. All our protocols are simple, elegant and are based on performing disjoint transpositions executed in parallel. The challenging problem of our concern is to prove that ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
(Show Context)
We analyze various stochastic processes for generating permutations almost uniformly at random in distributed and parallel systems. All our protocols are simple, elegant and are based on performing disjoint transpositions executed in parallel. The challenging problem of our concern is to prove that the output configurations in our processes reach almost uniform probability distribution very rapidly, i.e. in a (low) polylogarithmic time. For the analysis of the aforementioned protocols we develop a novel technique, called delayed path coupling, for proving rapid mixing of Markov chains. Our approach is an extension of the path coupling method of Bubley and Dyer. We apply delayed path coupling to three stochastic processes for generating random permutations. For one
Efficient Randomized Dictionary Matching Algorithms (Extended Abstract)
, 1992
"... The standard string matching problem involves finding all occurrences of a single pattern in a single text. While this approach works well in many application areas, there are some domains in which it is more appropriate to deal with dictionaries of patterns. A dictionary is a set of patterns; the ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
The standard string matching problem involves finding all occurrences of a single pattern in a single text. While this approach works well in many application areas, there are some domains in which it is more appropriate to deal with dictionaries of patterns. A dictionary is a set of patterns; the goal of dictionary matching is to find all dictionary patterns in a given text, simultaneously. In string matching, randomized algorithms have primarily made use of randomized hashing functions which convert strings into "signatures" or "finger prints". We explore the use of finger prints in conjunction with other randomized and deterministic techniques and data structures. We present several new algorithms for dictionary matching, along with parallel algorithms which are simpler of more efficient than previously known algorithms.
Tight Bounds for Parallel Randomized Load Balancing
 Computing Research Repository
, 1992
"... We explore the fundamental limits of distributed ballsintobins algorithms, i.e., algorithms where balls act in parallel, as separate agents. This problem was introduced by Adler et al., who showed that nonadaptive and symmetric algorithms cannot reliably perform better than a maximum bin load of Θ ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
(Show Context)
We explore the fundamental limits of distributed ballsintobins algorithms, i.e., algorithms where balls act in parallel, as separate agents. This problem was introduced by Adler et al., who showed that nonadaptive and symmetric algorithms cannot reliably perform better than a maximum bin load of Θ(loglogn/logloglogn) within the same number of rounds. We present an adaptive symmetric algorithm that achieves a bin load of two in log ∗ n + O(1) communication rounds using O(n) messages in total. Moreover, larger bin loads can be traded in for smaller time complexities. We prove a matching lower bound of (1−o(1))log ∗ n on the time complexity of symmetric algorithms that guarantee small bin loads at an asymptotically optimal message complexity of O(n). The essential preconditions of the proof are (i) a limit of O(n) on the total number of messages sent by the algorithm and (ii) anonymity of bins, i.e., the port numberings of balls are not globally consistent. In order to show that our technique yields indeed tight bounds, we provide for each assumption an algorithm violating it, in turn achieving a constant maximum bin load in constant time. As an application, we consider the following problem. Given a fully connected graph of n nodes, where each node needs to send and receive up to n messages, and in each round each node may send one message over each link, deliver all messages as quickly as possible to their destinations. We give a simple and robust algorithm of time complexity O(log ∗ n) for this task and provide a generalization to the case where all nodes initially hold arbitrary sets of messages. Completing the picture, we give a less practical, but asymptotically optimal algorithm terminating within O(1) rounds. All these bounds hold with high probability.