Results 11 - 20
of
191
Collection-Oriented Languages
- PROCEEDINGS OF THE IEEE
, 1991
"... Several programming languages arising from widely diverse practical and theoretical considerations share a common high-level feature: their basic data type is an aggregate of other more primitive data types and their primitive functions operate on these aggregates. Examples of such languages (and th ..."
Abstract
-
Cited by 49 (5 self)
- Add to MetaCart
Several programming languages arising from widely diverse practical and theoretical considerations share a common high-level feature: their basic data type is an aggregate of other more primitive data types and their primitive functions operate on these aggregates. Examples of such languages (and the collections they support) are FORTRAN 90 (arrays), APL (arrays), Connection Machine LISP (xectors), PARALATION LISP (paralations), and SETL (sets). Acting on large collections of data with a single operation is the hallmark of data-parallel programming and massively parallel computers. These languages --- which we call collection-oriented --- are thus ideal for use with massively parallel machines, even though many of them were developed before parallelism and associated considerations became important. This paper examines collections and the operations that can be performed on them in a language-independent manner. It also critically reviews and compares a variety of collection-oriented languages...
Randomized Routing on Fat-Trees
- Advances in Computing Research
, 1996
"... Fat-trees are a class of routing networks for hardware-efficient parallel computation. This paper presents a randomized algorithm for routing messages on a fat-tree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the ..."
Abstract
-
Cited by 47 (10 self)
- Add to MetaCart
Fat-trees are a class of routing networks for hardware-efficient parallel computation. This paper presents a randomized algorithm for routing messages on a fat-tree. The quality of the algorithm is measured in terms of the load factor of a set of messages to be routed, which is a lower bound on the time required to deliver the messages. We show that if a set of messages has load factor on a fat-tree with n processors, the number of delivery cycles (routing attempts) that the algorithm requires is O(+lg n lg lg n) with probability 1 \Gamma O(1=n). The best previous bound was O( lg n) for the off-line problem in which the set of messages is known in advance. In the context of a VLSI model that equates hardware cost with physical volume, the routing algorithm can be used to demonstrate that fat-trees are universal routing networks. Specifically, we prove that any routing network can be efficiently simulated by a fat-tree of comparable hardware cost. 1 Introduction Fat-trees constitute...
Planar Separators and Parallel Polygon Triangulation
, 1992
"... We show how to construct an O( p n)-separator decomposition of a planar graph G in O(n) time. Such a decomposition defines a binary tree where each node corresponds to a subgraph of G and stores an O( p n)-separator of that subgraph. We also show how to construct an O(n ffl )-way decomposition tree ..."
Abstract
-
Cited by 46 (7 self)
- Add to MetaCart
We show how to construct an O( p n)-separator decomposition of a planar graph G in O(n) time. Such a decomposition defines a binary tree where each node corresponds to a subgraph of G and stores an O( p n)-separator of that subgraph. We also show how to construct an O(n ffl )-way decomposition tree in parallel in O(log n) time so that each node corresponds to a subgraph of G and stores an O(n 1=2+ffl )-separator of that subgraph. We demonstrate the utility of such a separator decomposition by showing how it can be used in the design of a parallel algorithm for triangulating a simple polygon deterministically in O(log n) time using O(n= log n) processors on a CRCW PRAM. Keywords: Computational geometry, algorithmic graph theory, planar graphs, planar separators, polygon triangulation, parallel algorithms, PRAM model. 1 Introduction Let G = (V; E) be an n-node graph. An f(n)-separator is an f(n)-sized subset of V whose removal disconnects G into two subgraphs G 1 and G 2 each...
Fast parallel circuits for the quantum Fourier transform
- PROCEEDINGS 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE (FOCS’00)
, 2000
"... We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
We give new bounds on the circuit complexity of the quantum Fourier transform (QFT). We give an upper bound of O(log n + log log(1/ε)) on the circuit depth for computing an approximation of the QFT with respect to the modulus 2 n with error bounded by ε. Thus, even for exponentially small error, our circuits have depth O(log n). The best previous depth bound was O(n), even for approximations with constant error. Moreover, our circuits have size O(n log(n/ε)). We also give an upper bound of O(n(log n) 2 log log n) on the circuit size of the exact QFT modulo 2 n, for which the best previous bound was O(n 2). As an application of the above depth bound, we show that Shor’s factoring algorithm may be based on quantum circuits with depth only O(log n) and polynomial-size, in combination with classical polynomial-time pre- and post-processing. In the language of computational complexity, this implies that factoring is in the complexity class ZPP BQNC, where BQNC is the class of problems computable with bounded-error probability by quantum circuits with polylogarithmic depth and polynomial size. Finally, we prove an Ω(log n) lower bound on the depth complexity of approximations of the
A new notation for arrows
- In International Conference on Functional Programming (ICFP ’01
, 2001
"... The categorical notion of monad, used by Moggi to structure denotational descriptions, has proved to be a powerful tool for structuring combinator libraries. Moreover, the monadic programming style provides a convenient syntax for many kinds of computation, so that each library defines a new sublang ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
The categorical notion of monad, used by Moggi to structure denotational descriptions, has proved to be a powerful tool for structuring combinator libraries. Moreover, the monadic programming style provides a convenient syntax for many kinds of computation, so that each library defines a new sublanguage. Recently, several workers have proposed a generalization of monads, called variously “arrows ” or Freyd-categories. The extra generality promises to increase the power, expressiveness and efficiency of the embedded approach, but does not mesh as well with the native abstraction and application. Definitions are typically given in a point-free style, which is useful for proving general properties, but can be awkward for programming specific instances. In this paper we define a simple extension to the functional language Haskell that makes these new notions of computation more convenient to use. Our language is similar to the monadic style, and has similar reasoning properties. Moreover, it is extensible, in the sense that new combining forms can be defined as expressions in the host language. 1.
Radix Sort For Vector Multiprocessors
- In Proceedings Supercomputing '91
, 1991
"... We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve a ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second. Our sorting algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2. To develop our version we introduce three general techniques for mapping data-parallel algorithms ontovector multiprocessors. These techniques allow us to fully vectorize and parallelize the algorithm. The paper also derives equations that model the performance of our algorithm on the Y-MP. These equations are then used t...
Parallel Ear Decomposition Search (EDS) And ST-Numbering In Graphs
, 1986
"... [LEC-67] linear time serial algorithm for testing planarity of graphs uses the linear time serial algorithm of [ET-76] for st-numbering. This st-numbering algorithm is based on depth-first search (DFS). A known conjecture states that DFS, which is a key technique in designing serial algorithms, is n ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
[LEC-67] linear time serial algorithm for testing planarity of graphs uses the linear time serial algorithm of [ET-76] for st-numbering. This st-numbering algorithm is based on depth-first search (DFS). A known conjecture states that DFS, which is a key technique in designing serial algorithms, is not amenable to poly-log time parallelism using "around linearly" (or even polynomially) many processors. The first contribution of this paper is a general method for searching efficiently in parallel undirected graphs, called ear-decomposition search (EDS). The second contribution demonstrates the applicability of this search method. We present an efficient parallel algorithm for st-numbering in a biconnected graph. The algorithm runs in logarithmic time using a linear number of processors on a concurrentread concurrent-write (CRCW) PRAM. An efficient parallel algorithm for the problem did not exist before. The problem was not even known to be in NC. 1. Introduction We define the problems ...
Optimal Doubly Logarithmic Parallel Algorithms Based On Finding All Nearest Smaller Values
, 1993
"... The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) a ..."
Abstract
-
Cited by 36 (7 self)
- Add to MetaCart
The all nearest smaller values problem is defined as follows. Let A = (a 1 ; a 2 ; : : : ; an ) be n elements drawn from a totally ordered domain. For each a i , 1 i n, find the two nearest elements in A that are smaller than a i (if such exist): the left nearest smaller element a j (with j ! i) and the right nearest smaller element a k (with k ? i). We give an O(log log n) time optimal parallel algorithm for the problem on a CRCW PRAM. We apply this algorithm to achieve optimal O(log log n) time parallel algorithms for four problems: (i) Triangulating a monotone polygon, (ii) Preprocessing for answering range minimum queries in constant time, (iii) Reconstructing a binary tree from its inorder and either preorder or postorder numberings, (vi) Matching a legal sequence of parentheses. We also show that any optimal CRCW PRAM algorithm for the triangulation problem requires \Omega\Gammauir log n) time. Dept. of Computing, King's College London, The Strand, London WC2R 2LS, England. ...
Communication-Efficient Parallel Algorithms for Distributed Random-Access Machines
- Algorithmica
, 1988
"... This paper introduces a model for parallel computation, called the distributed random-access machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
This paper introduces a model for parallel computation, called the distributed random-access machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network. We introduce the notion of a conservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to "shortcut" pointers in a data structure so that remote processors can communicate without causing undue congestion. We give O(lg n)-step, linear-processor, linear-space, conservative algorithms for a variety of problems on n- node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions ...
Scan Primitives for Vector Computers
- In Proceedings Supercomputing '90
, 1990
"... This paper describes an optimized implementation of a set of scan (also called allprefix -sums) primitives on a single processor of a CRAY Y-MP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
This paper describes an optimized implementation of a set of scan (also called allprefix -sums) primitives on a single processor of a CRAY Y-MP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any register-based vector computer. On the CRAY Y-MP, the asymptotic running time of the plus-scan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1=2 ) for each set. The paper describes a radix sorting routine based on the scans that is 13 times faster ...

