Results 11  20
of
208
CollectionOriented Languages
 PROCEEDINGS OF THE IEEE
, 1991
"... Several programming languages arising from widely diverse practical and theoretical considerations share a common highlevel feature: their basic data type is an aggregate of other more primitive data types and their primitive functions operate on these aggregates. Examples of such languages (and th ..."
Abstract

Cited by 52 (5 self)
 Add to MetaCart
Several programming languages arising from widely diverse practical and theoretical considerations share a common highlevel feature: their basic data type is an aggregate of other more primitive data types and their primitive functions operate on these aggregates. Examples of such languages (and the collections they support) are FORTRAN 90 (arrays), APL (arrays), Connection Machine LISP (xectors), PARALATION LISP (paralations), and SETL (sets). Acting on large collections of data with a single operation is the hallmark of dataparallel programming and massively parallel computers. These languages  which we call collectionoriented  are thus ideal for use with massively parallel machines, even though many of them were developed before parallelism and associated considerations became important. This paper examines collections and the operations that can be performed on them in a languageindependent manner. It also critically reviews and compares a variety of collectionoriented languages...
Randomized Routing on FatTrees
 Advances in Computing Research
, 1996
"... Fattrees are a class of routing networks for hardwareefficient parallel computation. This paper presents a randomized algorithm for routing messages on a fattree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the ..."
Abstract

Cited by 51 (11 self)
 Add to MetaCart
Fattrees are a class of routing networks for hardwareefficient parallel computation. This paper presents a randomized algorithm for routing messages on a fattree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the time required to deliver the messages. We show that if a set of messages has load factor on a fattree with n processors, the number of delivery cycles (routing attempts) that the algorithm requires is O(+lg n lg lg n) with probability 1 \Gamma O(1=n). The best previous bound was O( lg n) for the offline problem in which the set of messages is known in advance. In the context of a VLSI model that equates hardware cost with physical volume, the routing algorithm can be used to demonstrate that fattrees are universal routing networks. Specifically, we prove that any routing network can be efficiently simulated by a fattree of comparable hardware cost. 1 Introduction Fattrees constitute...
Fast parallel circuits for the quantum Fourier transform
 PROCEEDINGS 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS’00)
, 2000
"... We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our ..."
Abstract

Cited by 51 (2 self)
 Add to MetaCart
We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our circuits have depth O(log n). The best previous depth bound was O(n), even for approximations with constant error. Moreover, our circuits have size O(n log(n/ε)). We also give an upper bound of O(n(log n) 2 log log n) on the circuit size of the exact QFT modulo 2 n, for which the best previous bound was O(n 2). As an application of the above depth bound, we show that Shor’s factoring algorithm may be based on quantum circuits with depth only O(log n) and polynomialsize, in combination with classical polynomialtime pre and postprocessing. In the language of computational complexity, this implies that factoring is in the complexity class ZPP BQNC, where BQNC is the class of problems computable with boundederror probability by quantum circuits with polylogarithmic depth and polynomial size. Finally, we prove an Ω(log n) lower bound on the depth complexity of approximations of the
A new notation for arrows
 In International Conference on Functional Programming (ICFP ’01
, 2001
"... The categorical notion of monad, used by Moggi to structure denotational descriptions, has proved to be a powerful tool for structuring combinator libraries. Moreover, the monadic programming style provides a convenient syntax for many kinds of computation, so that each library defines a new sublang ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
The categorical notion of monad, used by Moggi to structure denotational descriptions, has proved to be a powerful tool for structuring combinator libraries. Moreover, the monadic programming style provides a convenient syntax for many kinds of computation, so that each library defines a new sublanguage. Recently, several workers have proposed a generalization of monads, called variously “arrows ” or Freydcategories. The extra generality promises to increase the power, expressiveness and efficiency of the embedded approach, but does not mesh as well with the native abstraction and application. Definitions are typically given in a pointfree style, which is useful for proving general properties, but can be awkward for programming specific instances. In this paper we define a simple extension to the functional language Haskell that makes these new notions of computation more convenient to use. Our language is similar to the monadic style, and has similar reasoning properties. Moreover, it is extensible, in the sense that new combining forms can be defined as expressions in the host language. 1.
Removing Randomness in Parallel Computation Without a Processor Penalty
 Journal of Computer and System Sciences
, 1988
"... We develop some general techniques for converting randomized parallel algorithms into deterministic parallel algorithms without a blowup in the number of processors. One of the requirements for the application of these techniques is that the analysis of the randomized algorithm uses only pairwise in ..."
Abstract

Cited by 48 (1 self)
 Add to MetaCart
We develop some general techniques for converting randomized parallel algorithms into deterministic parallel algorithms without a blowup in the number of processors. One of the requirements for the application of these techniques is that the analysis of the randomized algorithm uses only pairwise independence. Our main new result is a parallel algorithm for coloring the vertices of an undirected graph using at most \Delta + 1 distinct colors in such a way that no two adjacent vertices receive the same color, where \Delta is the maximum degree of any vertex in the graph. The running time of the algorithm is O(log 3 n log log n) using a linear number of processors on a concurrent read, exclusive write (CREW) parallel random access machine (PRAM). 1 Our techniques also apply to several other problems, including the maximal independent set problem and the maximal matching problem. The application of the general technique to these last two problems is mostly of academic interest because...
Flattened butterfly: A costefficient topology for highradix networks
 in Proc. of the Intl. Symp. on Computer Architecture
, 2007
"... Increasing integratedcircuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly, a costefficient topology for highradix networks. On benign (loadbalanced) traffic, the flattened b ..."
Abstract

Cited by 46 (8 self)
 Add to MetaCart
Increasing integratedcircuit pin bandwidth has motivated a corresponding increase in the degree or radix of interconnection networks and their routers. This paper introduces the flattened butterfly, a costefficient topology for highradix networks. On benign (loadbalanced) traffic, the flattened butterfly approaches the cost/performance of a butterfly network and has roughly half the cost of a comparable performance Clos network. The advantage over the Clos is achieved by eliminating redundant hops when they are not needed for load balance. On adversarial traffic, the flattened butterfly matches the cost/performance of a foldedClos network and provides an order of magnitude better performance than a conventional butterfly. In this case, global adaptive routing is used to switch the flattened butterfly from minimal to nonminimal routing — using redundant hops only when they are needed. Minimal and nonminimal, oblivious and adaptive routing algorithms are evaluated on the flattened butterfly. We show that loadbalancing adversarial traffic requires nonminimal globallyadaptive routing and show that sequential allocators are required to avoid transient load imbalance when using adaptive routing algorithms. We also compare the cost of the flattened butterfly to foldedClos, hypercube, and butterfly networks with identical capacity and show that the flattened butterfly is more costefficient than foldedClos and hypercube topologies.
Radix Sort For Vector Multiprocessors
 In Proceedings Supercomputing '91
, 1991
"... We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY YMP. On one processor of the YMP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve a ..."
Abstract

Cited by 43 (6 self)
 Add to MetaCart
We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY YMP. On one processor of the YMP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64bit keys per second. Our sorting algorithm is adapted from a dataparallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM2. To develop our version we introduce three general techniques for mapping dataparallel algorithms ontovector multiprocessors. These techniques allow us to fully vectorize and parallelize the algorithm. The paper also derives equations that model the performance of our algorithm on the YMP. These equations are then used t...
Parallel Ear Decomposition Search (EDS) And STNumbering In Graphs
, 1986
"... [LEC67] linear time serial algorithm for testing planarity of graphs uses the linear time serial algorithm of [ET76] for stnumbering. This stnumbering algorithm is based on depthfirst search (DFS). A known conjecture states that DFS, which is a key technique in designing serial algorithms, is n ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
[LEC67] linear time serial algorithm for testing planarity of graphs uses the linear time serial algorithm of [ET76] for stnumbering. This stnumbering algorithm is based on depthfirst search (DFS). A known conjecture states that DFS, which is a key technique in designing serial algorithms, is not amenable to polylog time parallelism using "around linearly" (or even polynomially) many processors. The first contribution of this paper is a general method for searching efficiently in parallel undirected graphs, called eardecomposition search (EDS). The second contribution demonstrates the applicability of this search method. We present an efficient parallel algorithm for stnumbering in a biconnected graph. The algorithm runs in logarithmic time using a linear number of processors on a concurrentread concurrentwrite (CRCW) PRAM. An efficient parallel algorithm for the problem did not exist before. The problem was not even known to be in NC. 1. Introduction We define the problems ...
Scan Primitives for Vector Computers
 In Proceedings Supercomputing '90
, 1990
"... This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. ..."
Abstract

Cited by 39 (9 self)
 Add to MetaCart
This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any registerbased vector computer. On the CRAY YMP, the asymptotic running time of the plusscan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1=2 ) for each set. The paper describes a radix sorting routine based on the scans that is 13 times faster ...
CommunicationEfficient Parallel Algorithms for Distributed RandomAccess Machines
 Algorithmica
, 1988
"... This paper introduces a model for parallel computation, called the distributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
This paper introduces a model for parallel computation, called the distributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network. We introduce the notion of a conservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to "shortcut" pointers in a data structure so that remote processors can communicate without causing undue congestion. We give O(lg n)step, linearprocessor, linearspace, conservative algorithms for a variety of problems on n node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions ...