Results 1  10
of
13
Scans as Primitive Parallel Operations
 IEEE Transactions on Computers
, 1987
"... In most parallel randomaccess machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of ..."
Abstract

Cited by 157 (12 self)
 Add to MetaCart
In most parallel randomaccess machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of the effect of including in the PRAM models, such scan operations as unittime primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(lg n) factor, greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. We therefore argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radixsort algorithm, a quicksort algorithm, a minimumspanning tree algorithm, a linedrawing algorithm and a mergi...
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 95 (2 self)
 Add to MetaCart
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
The Owner Concept for PRAMs
, 1991
"... We analyze the owner concept for PRAMs. In OROWPRAMs each memory cell has one distinct processor that is the only one allowed to write into this memory cell and one distinct processor that is the only one allowed to read from it. By symmetric pointer doubling, a new proof technique for OROWPRAMs, ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
We analyze the owner concept for PRAMs. In OROWPRAMs each memory cell has one distinct processor that is the only one allowed to write into this memory cell and one distinct processor that is the only one allowed to read from it. By symmetric pointer doubling, a new proof technique for OROWPRAMs, it is shown that list ranking can be done in O(log n) time by an OROWPRAM and that LOGSPACE ` OROWTIME(log n). Then we prove that OROWPRAMs are a fairly robust model and recognize the same class of languages when the model is modified in several ways and that all kinds of PRAMs intertwine with the NC hierarchy without timeloss. Finally it is shown that EREWPRAMs can be simulated by OREWPRAMs and ERCWPRAMs by ORCWPRAMs. 3 This research was partially supported by the Deutsche Forschungsgemeinschaft, SFB 342, Teilprojekt A4 "Klassifikation und Parallelisierung durch Reduktionsanalyse" y Email: rossmani@lan.informatik.tumuenchen.dbp.de Introduction Fortune and Wyllie introduced in...
Feasible TimeOptimal Algorithms for Boolean Functions on ExclusiveWrite PRAMs
, 1994
"... It was shown some years ago that the computation time for many important Boolean functions of n arguments on concurrentread exclusivewrite parallel randomaccess machines (CREW PRAMs) of unlimited size is at least '(n) 0:72 log 2 n. On the other hand, it is known that every Boolean function of n ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
It was shown some years ago that the computation time for many important Boolean functions of n arguments on concurrentread exclusivewrite parallel randomaccess machines (CREW PRAMs) of unlimited size is at least '(n) 0:72 log 2 n. On the other hand, it is known that every Boolean function of n arguments can be computed in '(n) + 1 steps on a CREW PRAM with n \Delta 2 n\Gamma1 processors and memory cells. In the case of the OR of n bits, n processors and cells are sufficient. In this paper it is shown that for many important functions there are CREW PRAM algorithms that almost meet the lower bound in that they take '(n) + o(log n) steps, but use only a small number of processors and memory cells (in most cases, n). In addition, the cells only have to store binary words of bounded length (in most cases, length 1). We call such algorithms "feasible". The functions concerned include: the PARITY function and, more generally, all symmetric functions; a large class of Boolean formulas...
The Strict Time Lower Bound and Optimal Schedules for Parallel Prefix with Resource Constraints
 IEEE Trans. Comput
, 1996
"... Parallel prefix is a fundamental common operation at the core of many important applications, e.g., the Grand Challenge problems, circuit design, digital signal processing, graph optimizations, and computational geometry. Given x 1 ; . . . ; xN , parallel prefix computes x 1 ffi x 2 ffi . . . ffi x ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Parallel prefix is a fundamental common operation at the core of many important applications, e.g., the Grand Challenge problems, circuit design, digital signal processing, graph optimizations, and computational geometry. Given x 1 ; . . . ; xN , parallel prefix computes x 1 ffi x 2 ffi . . . ffi x k , for 1 k N , with associative operation ffi. For prefix of N elements on p processors in N ? p(p+1)=2, we derive Harmonic Schedules and show that the Harmonic Schedules achieve the strict optimal time (steps), d2(N 0 1)=(p + 1)e. We also derived Pipelined Schedules, optimal schedules with d2(N 0 1)=(p + 1)e + d(p 0 1)=2e 0 1 time, which take a constant overhead of d(p 0 1)=2e time steps more than the strict optimal time but have the smallest loop body. Both the Harmonic Schedules and the Pipelined Schedules are simple, concise, with nice patterns of computation organizations, and easy to program. For prefix of N elements on p processors in N p(p + 1)=2, we use an algorithm to constru...
An Algebra of Scans
 In Mathematics of Program Construction
, 2004
"... A parallel prefix circuit takes n inputs x1 , x2 , . . . , xn and produces the n outputs x1 , x1 x2 , . . . , x1 x2 xn , where `#' is an arbitrary associative binary operation. Parallel prefix circuits and their counterparts in software, parallel prefix computations or scans, have numerous app ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
A parallel prefix circuit takes n inputs x1 , x2 , . . . , xn and produces the n outputs x1 , x1 x2 , . . . , x1 x2 xn , where `#' is an arbitrary associative binary operation. Parallel prefix circuits and their counterparts in software, parallel prefix computations or scans, have numerous applications ranging from fast integer addition over parallel sorting to convex hull problems. A parallel prefix circuit can be implemented in a variety of ways taking into account constraints on size, depth, or fanout. Traditionally, implementations are either defined graphically or by enumerating the underlying graph. Both approaches have their pros and cons. A figure if well drawn conveys the possibly recursive structure of the scan but it is not amenable to formal manipulation. A description in form of a graph while rigorous obscures the structure of a scan and is equally hard to manipulate. In this paper we show that parallel prefix circuits enjoy a very pleasant algebra. Using only two basic building blocks and four combinators all standard designs can be described succinctly and rigorously. The rules of the algebra allow us to prove the circuits correct and to derive circuit designs in a systematic manner. lord darlington. . . . [Sees a fan lying on the table.] And what a wonderful fan! May I look at it? lady windermere. Do. Pretty, isn't it! It's got my name on it, and everything. I have only just seen it myself. It's my husband's birthday present to me. You know today is my birthday?  Oscar Wilde, Lady Windermere's Fan 1
Constructing zerodeficiency parallel prefix circuits of minimum depth
 ACM Trans. on Design Automation of Electronic Systems
, 2006
"... A parallel prefix circuit has n inputs x1, x2,..., xn, and computes the n outputs yi = xi • xi−1 •···• x1,1 ≤ i ≤ n, in parallel, where • is an arbitrary binary associative operator. Snir proved that the depth t and size s of any parallel prefix circuit satisfy the inequality t +s ≥ 2n−2. Hence, a p ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
A parallel prefix circuit has n inputs x1, x2,..., xn, and computes the n outputs yi = xi • xi−1 •···• x1,1 ≤ i ≤ n, in parallel, where • is an arbitrary binary associative operator. Snir proved that the depth t and size s of any parallel prefix circuit satisfy the inequality t +s ≥ 2n−2. Hence, a parallel prefix circuit is said to be of zerodeficiency if equality holds. In this article, we provide a different proof for Snir’s theorem by capturing the structural information of zerodeficiency prefix circuits. Following our proof, we propose a new kind of zerodeficiency prefix circuit Z (d) by constructing a prefix circuit as wide as possible for a given depth d. It is proved that the Z (d) circuit has the minimal depth among all possible zerodeficiency prefix circuits.
CellBased Multilevel CarryIncrement Adders with Minimal AT and PTProducts
, 1996
"... Carryselect addition techniques imply the computation of double sum and carry bits with subsequent selection of the correct values, resulting in significant area overheads. This overhead increases massively when the selection scheme is applied to multiple levels in order to further reduce computati ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Carryselect addition techniques imply the computation of double sum and carry bits with subsequent selection of the correct values, resulting in significant area overheads. This overhead increases massively when the selection scheme is applied to multiple levels in order to further reduce computation time. A recently proposed reducedarea scheme for carryselect adders lowers this overhead by computing the carry and sum bits for a blockcarryin value of 0 only and by incrementing them afterwards depending on the final blockcarryin. The resulting carryincrement adder cuts circuit size down by 23% with no change in performance. This paper extends this increment scheme hierarchically to build multilevel carryincrement adders. It is shown that such adders are considerably faster while maintaining the same area complexity. The implemented 2level carryincrement adder has roughly the same size as the 1level version, but is up to 29% faster. For large word lengths (up to 128 bits), it...
Computing Programs Containing Band Linear Recurrences on Vector Supercomputers
 IEEE Trans. on Parallel and Distributed Systems
, 1992
"... Many largescale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLR's). Conventional compiler parallelization techniques [4] cannot generate scalable parallel code ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Many largescale scientific and engineering computations, e.g., some of the Grand Challenge problems [1], spend a major portion of execution time in their core loops computing band linear recurrences (BLR's). Conventional compiler parallelization techniques [4] cannot generate scalable parallel code for this type of computation because they respect loopcarried dependences (LCD's) in programs and there is a limited amount of parallelism in a BLR with respect to LCD's. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm, called the Regular Schedule, for parallel evaluation of BLR's. We describe our implementation of the Regular Schedule and discuss how to obtain maximummemory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallel...
Online adaptive parallel prefix computation
"... Abstract. We consider parallel prefix computation on processors of different and possibly changing speeds. Extending previous works on identical processors, we provide a lower bound for this problem. We introduce a new adaptive algorithm which is based on the online recursive coupling of an optimal ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Abstract. We consider parallel prefix computation on processors of different and possibly changing speeds. Extending previous works on identical processors, we provide a lower bound for this problem. We introduce a new adaptive algorithm which is based on the online recursive coupling of an optimal sequential algorithm and a parallel one, nonoptimal but recursive and finegrain. The coupling relies on a workstealing scheduling. Its theoretical performance is analysed on p processors of different and changing speeds. It is close to the lower bound both on identical processors and close to the lower bound for processors of changing speeds. Experiments performed on an eightprocessor machine confirms this theoretical result. 1