Results 1  10
of
34
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 128 (2 self)
 Add to MetaCart
(Show Context)
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Fast parallel circuits for the quantum Fourier transform
 PROCEEDINGS 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS’00)
, 2000
"... We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our ..."
Abstract

Cited by 70 (1 self)
 Add to MetaCart
(Show Context)
We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our circuits have depth O(log n). The best previous depth bound was O(n), even for approximations with constant error. Moreover, our circuits have size O(n log(n/ε)). We also give an upper bound of O(n(log n) 2 log log n) on the circuit size of the exact QFT modulo 2 n, for which the best previous bound was O(n 2). As an application of the above depth bound, we show that Shor’s factoring algorithm may be based on quantum circuits with depth only O(log n) and polynomialsize, in combination with classical polynomialtime pre and postprocessing. In the language of computational complexity, this implies that factoring is in the complexity class ZPP BQNC, where BQNC is the class of problems computable with boundederror probability by quantum circuits with polylogarithmic depth and polynomial size. Finally, we prove an Ω(log n) lower bound on the depth complexity of approximations of the
A Family of Adders
 In Proceedings of 14th IEEE Symposium on Computer Arithmetic
, 1999
"... Binary carrypropagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge & Stone and Ladner & Fischer. ..."
Abstract

Cited by 58 (0 self)
 Add to MetaCart
Binary carrypropagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge & Stone and Ladner & Fischer. In this work we show that these are end cases of a large family of addition structures, all of which share the attractive property of minimum logical depth. The intermediate structures allow tradeoffs between the amount of internal wiring and the fanout of intermediate nodes, and can thus usually achieve a more attractive combination of speed and area/power cost than either of the known endcases. Rules for the construction of such adders are given, as are examples of realistic 32b designs implemented in an industrial 0u25 CMOS process. 1. Introduction There are many ways of formulating the process of binary addition. Each different way provides different insight and thus suggests different impl...
Scan Primitives for Vector Computers
 In Proceedings Supercomputing '90
, 1990
"... This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. ..."
Abstract

Cited by 43 (9 self)
 Add to MetaCart
(Show Context)
This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any registerbased vector computer. On the CRAY YMP, the asymptotic running time of the plusscan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1=2 ) for each set. The paper describes a radix sorting routine based on the scans that is 13 times faster ...
CommunicationEfficient Parallel Algorithms for Distributed RandomAccess Machines
 Algorithmica
, 1988
"... This paper introduces a model for parallel computation, called the distributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
(Show Context)
This paper introduces a model for parallel computation, called the distributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network. We introduce the notion of a conservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to "shortcut" pointers in a data structure so that remote processors can communicate without causing undue congestion. We give O(lg n)step, linearprocessor, linearspace, conservative algorithms for a variety of problems on n node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions ...
K.Y.S.: The strict time lower bound and optimal schedules for parallel prefix with resource constraints
 IEEE Trans. Comput
, 1996
"... ..."
(Show Context)
Optimal Carry Save Networks
"... A general theory is developed for constructing the asymptotically shallowest networks and the asymptotically smallest networks (with respect to formula size) for the carry save addition of n numbers using any given basic carry save adder as a building block. Using these optimal carry save additi ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
A general theory is developed for constructing the asymptotically shallowest networks and the asymptotically smallest networks (with respect to formula size) for the carry save addition of n numbers using any given basic carry save adder as a building block. Using these optimal carry save addition networks the shallowest known multiplication circuits and the shortest formulae for the majority function (and many other symmetric Boolean functions) are obtained. In this paper, simple basic carry save adders are described, using which multiplication circuits of depth 3:71 log n (the result of which is given as the sum of two numbers) and majority formulae of size O(n 3:21 ) are constructed. Using more complicated basic carry save adders, not described here, these results could be further improved. Our best bounds are currently 3:57 log n for depth and O(n 3:13 ) for formula size. 1. Introduction The question `How fast can we multiply?' is one of the fundamental questions...
Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations
, 2000
"... By establishing a relationship between the basic properties of linear computations and eight optimizing transformations (distributivity, associativity, commutativity, inverse and zero element law, common subexpression replication and elimination, constant propagation), a computeraided design platfo ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
By establishing a relationship between the basic properties of linear computations and eight optimizing transformations (distributivity, associativity, commutativity, inverse and zero element law, common subexpression replication and elimination, constant propagation), a computeraided design platform is developed to optimally speedup an arbitrary instance from this large class of computations with respect to those transformations. Furthermore, arbitrarily fast implementation of an arbitrary linear computation is obtained by adding loop unrolling to the transformations set. During this process, a novel Horner pipelining scheme is used so that the areatime (AT) product is maintained constant, regardless of achieved speedup. We also present a generalization of the new approach so that an important subclass of nonlinear computations, named feedback linear computations, is efficiently, maximally, and arbitrarily spedup.