Results 1  10
of
286
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 193 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Simple linear work suffix array construction
, 2003
"... Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more exp ..."
Abstract

Cited by 152 (6 self)
 Add to MetaCart
Abstract. Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to lineartime construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple lineartime construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a spaceefficient implementation and, moreover, supports the choice of a space–time tradeoff. For any v ∈ [1, √ n], it runs in O(vn) time using O(n / √ v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREWPRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.
Geometric Pattern Matching under Euclidean Motion
, 1993
"... Given two planar sets A and B, we examine the problem of determining the smallest " such that there is a Euclidean motion (rotation and translation) of A that brings each member of A within distance " of some member of B. We establish upper bounds on the combinatorial complexity of this su ..."
Abstract

Cited by 72 (2 self)
 Add to MetaCart
Given two planar sets A and B, we examine the problem of determining the smallest " such that there is a Euclidean motion (rotation and translation) of A that brings each member of A within distance " of some member of B. We establish upper bounds on the combinatorial complexity of this subproblem in modelbased computer vision, when the sets A and B contain points, line segments, or (filledin) polygons. We also show how to use our methods to substantially improve on existing algorithms for finding the minimum Hausdorff distance under Euclidean motion. 1 Author's address: Department of Computer Science, Cornell University, Ithaca, NY 14853. This work was supported by the Advanced Research Projects Agency of the Department of Defense under ONR Contract N0001492J1989, and by ONR Contract N0001492J1839, NSF Contract IRI9006137, and AFOSR Contract AFOSR910328. 2 Author's address: Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218. This work was suppo...
Parallel SymmetryBreaking in Sparse Graphs
 SIAM J. Disc. Math
, 1987
"... We describe efficient deterministic techniques for breaking symmetry in parallel. These techniques work well on rooted trees and graphs of constant degree or genus. Our primary technique allows us to 3color a rooted tree in O(lg n) time on an EREW PRAM using a linear number of processors. We use th ..."
Abstract

Cited by 71 (2 self)
 Add to MetaCart
We describe efficient deterministic techniques for breaking symmetry in parallel. These techniques work well on rooted trees and graphs of constant degree or genus. Our primary technique allows us to 3color a rooted tree in O(lg n) time on an EREW PRAM using a linear number of processors. We use these techniques to construct fast linear processor algorithms for several problems, including (\Delta + 1)coloring constantdegree graphs and 5coloring planar graphs. We also prove lower bounds for 2coloring directed lists and for finding maximal independent sets in arbitrary graphs. 1 Introduction Some problems for which trivial sequential algorithms exist appear to be much harder to solve in a parallel framework. When converting a sequential algorithm to a parallel one, at each step of the parallel algorithm we have to choose a set of operations which may be executed in parallel. Often, we have to choose these operations from a large set A preliminary version of this paper appear...
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix ..."
Abstract

Cited by 63 (4 self)
 Add to MetaCart
We describe the design of highperformance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparisonbased sort reported in the literature. Our radix sort is up to 4 times faster than the graphicsbased GPUSort and greater than 2 times faster than other CUDAbased radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial finegrained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the highspeed onchip shared memory provided by NVIDIA’s GPU architecture and efficient dataparallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
Optimal and Sublogarithmic Time Randomized Parallel Sorting Algorithms
 SIAM JOURNAL ON COMPUTING
, 1989
"... We assume a parallel RAM model which allows both concurrent reads and concurrent writes of a global memory. Our main result is an optimal randomized parallel algorithm for INTEGER SORT (i.e., for sorting n integers in the range [1; n]). Our algorithm costs only logarithmic time and is the first know ..."
Abstract

Cited by 62 (12 self)
 Add to MetaCart
We assume a parallel RAM model which allows both concurrent reads and concurrent writes of a global memory. Our main result is an optimal randomized parallel algorithm for INTEGER SORT (i.e., for sorting n integers in the range [1; n]). Our algorithm costs only logarithmic time and is the first known that is optimal: the product of its time and processor bounds is upper bounded by a linear function of the input size. We also give a deterministic sublogarithmic time algorithm for prefix sum. In addition we present a sublogarithmic time algorithm for obtaining a random permutation of n elements in parallel. And finally, we present sublogarithmic time algorithms for GENERAL SORT and INTEGER SORT. Our sublogarithmic GENERAL SORT algorithm is also optimal.
Parallel Construction of Quadtrees and Quality Triangulations
, 1999
"... We describe e#cient PRAM algorithms for constructing unbalanced quadtrees, balanced quadtrees, and quadtreebased finite element meshes. Our algorithms take time O(log n) for point set input and O(log n log k) time for planar straightline graphs, using O(n + k/ log n) processors, where n measure ..."
Abstract

Cited by 60 (7 self)
 Add to MetaCart
We describe e#cient PRAM algorithms for constructing unbalanced quadtrees, balanced quadtrees, and quadtreebased finite element meshes. Our algorithms take time O(log n) for point set input and O(log n log k) time for planar straightline graphs, using O(n + k/ log n) processors, where n measures input size and k output size. 1. Introduction A crucial preprocessing step for the finite element method is mesh generation, and the most general and versatile type of twodimensional mesh is an unstructured triangular mesh. Such a mesh is simply a triangulation of the input domain (e.g., a polygon), along with some extra vertices, called Steiner points. Not all triangulations, however, serve equally well; numerical and discretization error depend on the quality of the triangulation, meaning the shapes and sizes of triangles. A typical quality guarantee gives a lower bound on the minimum angle in the triangulation. Baker et al. 1 first proved the existence of quality triangulations fo...