Results 1 - 10
of
12
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 163 (7 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
NESL: A nested data-parallel language (version 2.6
, 1993
"... The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Wright Laboratory or the U. S. Government. Keywords: Data-parallel, parallel algorithms, supe ..."
Abstract
-
Cited by 87 (7 self)
- Add to MetaCart
The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Wright Laboratory or the U. S. Government. Keywords: Data-parallel, parallel algorithms, supercomputers, nested parallelism, This report describes Nesl, a strongly-typed, applicative, data-parallel language. Nesl is intended to be used as a portable interface for programming a variety of parallel and vector computers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences, including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences. Nesl fully supports nested sequences and nested parallelism—the ability to take a parallel function and apply it over multiple instances in parallel. Nested parallelism is important for implementing algorithms with irregular nested loops (where the inner loop lengths depend on the outer iteration) and for divide-and-conquer algorithms. Nesl also provides a performance model for calculating the asymptotic performance of a program on
Sublinear-Time Parallel Algorithms for Matching and Related Problems
, 1988
"... This paper presents the first sublinear-time deterministic parallel algorithms for bipartite matching and several related problems, including maximal node-disjoint paths, depth-first search, and flows in zero-one networks. Our results are based on a better understanding of the combinatorial struc ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
This paper presents the first sublinear-time deterministic parallel algorithms for bipartite matching and several related problems, including maximal node-disjoint paths, depth-first search, and flows in zero-one networks. Our results are based on a better understanding of the combinatorial structure of the above problems, which leads to new algorithmic techniques. In particular, we show how to use maximal matching to extend, in parallel, a current set of nodedisjoint paths and how to take advantage of the parallelism that arises when a large number of nodes are "active" during an execution of a push-relabel network flow algorithm. We also show how to apply our techniques to design parallel algorithms for the weighted versions of the above problems. In particular, we present sublinear-time deterministic parallel algorithms for finding a minimum-weight bipartite matching and for finding a minimum-cost flow in a network with zero-one capacities, if the weights are polynomially ...
A Fast and Simple Algorithm for the Maximum Flow Problem
- OPERATIONS RESEARCH
, 1989
"... We present a simple sequential algorithm for the maximum flow problem on a network with n nodes, m arcs, and integer arc capacities bounded by U. Under the practical assumption that U is polynomially bounded in n, our algorithm runs in time O(nm + n 2 log n). This result improves the previous best b ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
We present a simple sequential algorithm for the maximum flow problem on a network with n nodes, m arcs, and integer arc capacities bounded by U. Under the practical assumption that U is polynomially bounded in n, our algorithm runs in time O(nm + n 2 log n). This result improves the previous best bound of O(nm log(n 2 /m)), obtained by Goldberg and Taran, by a factor of log n for networks that are both nonsparse and nondense without using any complex data structures. We also describe a parallel implementation of the algorithm that runs in O(n'log U log p) time in the PRAM model with EREW and uses only p processors where p = [m/n
On Parallel Hashing and Integer Sorting
, 1991
"... The problem of sorting n integers from a restricted range [1::m], where m is superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = n polylog(n) we have an O(n log log n) algorithm.) The al ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
The problem of sorting n integers from a restricted range [1::m], where m is superpolynomial in n, is considered. An o(n log n) randomized algorithm is given. Our algorithm takes O(n log log m) expected time and O(n) space. (Thus, for m = n polylog(n) we have an O(n log log n) algorithm.) The algorithm is parallelizable. The resulting parallel algorithm achieves optimal speed up. Some features of the algorithm make us believe that it is relevant for practical applications. A result of independent interest is a parallel hashing technique. The expected construction time is logarithmic using an optimal number of processors, and searching for a value takes O(1) time in the worst case. This technique enables drastic reduction of space requirements for the price of using randomness. Applicability of the technique is demonstrated for the parallel sorting algorithm, and for some parallel string matching algorithms. The parallel sorting algorithm is designed for a strong and non standard mo...
Trade-offs Between Communication Throughput and Parallel Time
, 1994
"... We study the effect of limited communication throughput on parallel computation in a setting where the number of processors is much smaller than the length of the input. Our model has p processors that communicate through a shared memory of size m. The input has size n, and can be read directly by a ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
We study the effect of limited communication throughput on parallel computation in a setting where the number of processors is much smaller than the length of the input. Our model has p processors that communicate through a shared memory of size m. The input has size n, and can be read directly by all the processors. We will be primarily interested in studying cases where n AE p AE m. As a test case we study the list reversal problem. For this problem we prove a time lower bound of \Omega\Gamma n p mp ). (A similar lower bound holds also for the problems of sorting, finding all unique elements, convolution, and universal hashing.) This result shows that limiting the communication (i.e., small m) has significant effect on parallel computation. We show an almost matching upper bound of O( n p mp log O(1) n). The upper bound requires the development of a few interesting techniques which can alleviate the limited communication in some
Lower bounds in a parallel model without bit operations
- TO APPEAR IN THE SIAM JOURNAL ON COMPUTING
, 1997
"... ..."
Structural Parallel Algorithmics
, 1991
"... The first half of the paper is a general introduction which emphasizes the central role that the PRAM model of parallel computation plays in algorithmic studies for parallel computers. Some of the collective knowledge-base on non-numerical parallel algorithms can be characterized in a structural way ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
The first half of the paper is a general introduction which emphasizes the central role that the PRAM model of parallel computation plays in algorithmic studies for parallel computers. Some of the collective knowledge-base on non-numerical parallel algorithms can be characterized in a structural way. Each structure relates a few problems and technique to one another from the basic to the more involved. The second half of the paper provides a bird's-eye view of such structures for: (1) list, tree and graph parallel algorithms; (2) very fast deterministic parallel algorithms; and (3) very fast randomized parallel algorithms. 1 Introduction Parallelism is a concern that is missing from "traditional" algorithmic design. Unfortunately, it turns out that most efficient serial algorithms become rather inefficient parallel algorithms. The experience is that the design of parallel algorithms requires new paradigms and techniques, offering an exciting intellectual challenge. We note that it had...
Thinking in Parallel: Some Basic DataParallel Algorithms and Techniques
- College Park, MD
, 1993
"... PRAM-On-Chip Explicit Multi-Threading (XMT) platform is provided through the XMT home page www.umiacs.umd.edu/users/vishkin/XMT and the class home page. Comments are welcome: please write to me using my last name at umd.edu ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
PRAM-On-Chip Explicit Multi-Threading (XMT) platform is provided through the XMT home page www.umiacs.umd.edu/users/vishkin/XMT and the class home page. Comments are welcome: please write to me using my last name at umd.edu
On Implementing Graph Cuts on CUDA
"... has enabled graphics processors to be explicitly programmed as general-purpose shared-memory multi-core processors with a high level of parallelism. In this paper, we present our preliminary results of implementing the Graph Cuts algorithm on CUDA. Our primary focus is on implementing Graph Cuts on ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
has enabled graphics processors to be explicitly programmed as general-purpose shared-memory multi-core processors with a high level of parallelism. In this paper, we present our preliminary results of implementing the Graph Cuts algorithm on CUDA. Our primary focus is on implementing Graph Cuts on grid graphs, which are extensively used in imaging applications. We first explain our implementation of breadth first search (BFS) graph traversal on CUDA, which is extensively used in our Graph Cuts implementation. We then present a basic implementation of Graph Cuts that succeeds to achieve absolute and relative speedups when used for foreground-background segmentation on synthesized images. Finally, we introduce two optimizations that utilize the special structure of grid graphs. The first one is lockstep BFS, which is used to reduce the overhead of BFS traversals. The second is cache emulation, which is a general technique to regularize memory access patterns and hence enhance memory access throughput. We experimentally show how each of the two optimizations can enhance the performance of the basic implementation on the image segmentation application. I.

