Results 11  20
of
163
Counting Networks and MultiProcessor Coordination (Extended Abstract)
 In Proceedings of the 23rd Annual Symposium on Theory of Computing
, 1991
"... ) James Aspnes Maurice Herlihy y Nir Shavit z Digital Equipment Corporation Cambridge Research Lab CRL 90/11 September 18, 1991 Abstract Many fundamental multiprocessor coordination problems can be expressed as counting problems: processes must cooperate to assign successive values from a g ..."
Abstract

Cited by 43 (7 self)
 Add to MetaCart
) James Aspnes Maurice Herlihy y Nir Shavit z Digital Equipment Corporation Cambridge Research Lab CRL 90/11 September 18, 1991 Abstract Many fundamental multiprocessor coordination problems can be expressed as counting problems: processes must cooperate to assign successive values from a given range, such as addresses in memory or destinations on an interconnection network. Conventional solutions to these problems perform poorly because of synchronization bottlenecks and high memory contention. Motivated by observations on the behavior of sorting networks, we offer a completely new approach to solving such problems. We introduce a new class of networks called counting networks, i.e., networks that can be used to count. We give a counting network construction of depth log 2 n using n log 2 n "gates," avoiding the sequential bottlenecks inherent to former solutions, and having a provably lower contention factor on its gates. Finally, to show that counting networks are not ...
SmallDepth Counting Networks
, 1992
"... Generalizing the notion of a sorting network, Aspnes, Herlihy, and Shavit recently introduced a class of socalled "counting" networks, and established an O(lg 2 n) upper bound on the depth complexity of such networks. Their work was motivated by a number of practical applications arising in the dom ..."
Abstract

Cited by 41 (2 self)
 Add to MetaCart
Generalizing the notion of a sorting network, Aspnes, Herlihy, and Shavit recently introduced a class of socalled "counting" networks, and established an O(lg 2 n) upper bound on the depth complexity of such networks. Their work was motivated by a number of practical applications arising in the domain of asynchronous shared memory machines. This paper continues the analysis of counting networks, providing a number of new upper bounds. In particular, we present an explicit construction of an O(c lg* lg n) depth counting network, a randomized construction of an O(lgn)depth network (that works with extremely high probability), and using the random con struction we present an existential proof of a de terministic O(lgn)depth network. The latter result matches the trivial (lgn)depth lower bound to within a constant factor.
Improved Parallel Integer Sorting without Concurrent Writing
, 1992
"... We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary ..."
Abstract

Cited by 41 (4 self)
 Add to MetaCart
We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary given t log n. In addition, we are able to sort n arbitrary integers on a randomized CREW PRAM within the same resource bounds with high probability. In each case our algorithm is a factor of almost \Theta( p log n) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best previous sequential algorithm. We also show that n integers in the range 1 : : m can be sorted in O((log n) 2 ) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(log n log log n log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear t...
Efficient Computation of Recurrence Diameters
 4th International Conference on Verification, Model Checking, and Abstract Interpretation, volume 2575 of Lecture Notes in Computer Science
, 2003
"... SAT based Bounded Model Checking (BMC) is an efficient method for detecting logical errors in finitestate transition systems. Given a transition system, an LTL property, and a user defined bound k, a bounded model checker generates a propositional formula that is satisfiable if and only if a counte ..."
Abstract

Cited by 36 (14 self)
 Add to MetaCart
SAT based Bounded Model Checking (BMC) is an efficient method for detecting logical errors in finitestate transition systems. Given a transition system, an LTL property, and a user defined bound k, a bounded model checker generates a propositional formula that is satisfiable if and only if a counterexample to the property of length up to k exists. Standard SAT checkers can be used to check this formula. BMC is complete if k is larger than some precomputed threshold. It is still unknown how to compute this threshold for general properties. We show that the longest initialized loopfree path in the state graph, also known as the recurrence diameter, is sufficient for Fp properties. The recurrence diameter is also a known overapproximation for the threshold of simple safety properties (Gp). We discuss various techniques to compute the recurrence diameter efficiently and provide experimental results that demonstrate the benefits of using the new approach.
GPUABiSort: Optimal parallel sorting on stream architectures
 IN PROCEEDINGS OF THE 20TH IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS ’06) (APR
, 2006
"... In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach compe ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. This is achieved by using efficient linear stream memory accesses and by combining the optimal time approach with algorithms optimized for small input sequences. We present an implementation on modern programmable graphics hardware (GPUs). On recent GPUs, our optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous nonoptimal sorting approaches on the GPU for sufficiently large input sequences. Because of the excellent scalability of our algorithm with the number of stream processor units p (up to n / log 2 n or even n / log n units, depending on the stream architecture), our approach profits heavily from the trend of increasing number of fragment processor units on GPUs, so that we can expect further speed improvement with upcoming GPU generations.
On the CostEffectiveness of PRAMs
, 1991
"... We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. Present computers of this type support the programming model of a shared memory. But simultaneous access to the shared memory by seve ..."
Abstract

Cited by 33 (12 self)
 Add to MetaCart
We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. Present computers of this type support the programming model of a shared memory. But simultaneous access to the shared memory by several processors is in many situations processed sequentially. Asymptotically good solutions for this problem are offered by theoretical computer science. We modify these constructions under engineering aspects and improve the price/performance ratio by roughly a factor of 6. The resulting machine has surprisingly good price/performance ratio even if compared with distributed memory machines. For almost all access patterns of all processors into the shared memory, access is as fast as the access of only a single processor. 1 Introduction Commercially available parallel machines can be classified as distributed memory machines or shared memory machines. Exchange of data between different proce...
Efficient Parallel Evaluation of Straightline Code and Arithmetic Circuits
 SIAM J. Comput
, 1988
"... A new parallel algorithm is given to evaluate a straight line program. The algorithm evaluates a program over a commutative semiring R of degree d and size n in time O(log n(log nd)) using M(n) processors, where M(n) is the number of processors required for multiplying n \Theta n matrices over the ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
A new parallel algorithm is given to evaluate a straight line program. The algorithm evaluates a program over a commutative semiring R of degree d and size n in time O(log n(log nd)) using M(n) processors, where M(n) is the number of processors required for multiplying n \Theta n matrices over the semiring R in O(log n) time. Appears in SIAM J. Comput., 17/4, pp. 687695 (1988). Preliminary version of this paper appeared in [6]. y Research supported in part by National Science Foundation Grant MCS800756 A01. z Research supported by NSF under ECS8404866, the Semiconductor Research Corporation under RSCH 84060496, and by an IBM Faculty Development Award. x Research Supported in part by NSF Grant DCR8504391 and by an IBM Faculty Development Award. 1 INTRODUCTION 1 1 Introduction In this paper we consider the problem of dynamic evaluation of a straight line program in parallel. This is a generalization of the result of Valiant et al [10]. They consider the problem of ta...
Further Algorithmic Aspects of the Local Lemma
, 2001
"... We provide a method to produce an efficient algorithm to find an object whose existence is guaranteed by the Lov'asz Local Lemma. We feel that this method will apply to the vast majority of applications of the Local Lemma, unless the application has one of four problematic traits. However, proving ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
We provide a method to produce an efficient algorithm to find an object whose existence is guaranteed by the Lov'asz Local Lemma. We feel that this method will apply to the vast majority of applications of the Local Lemma, unless the application has one of four problematic traits. However, proving that the method applies to a particular application may require proving two (possibly difficult) concentrationlike properties.
Parallel Algorithmic Techniques for Combinatorial Computation
 Ann. Rev. Comput. Sci
, 1988
"... this paper and supplied many helpful comments. This research was supported in part by NSF grants DCR8511713, CCR8605353, and CCR8814977, and by DARPA contract N0003984C0165. ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
this paper and supplied many helpful comments. This research was supported in part by NSF grants DCR8511713, CCR8605353, and CCR8814977, and by DARPA contract N0003984C0165.
Parallel Sorting With Limited Bandwidth
 in Proc. 7th ACM Symp. on Parallel Algorithms and Architectures
, 1995
"... We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation an ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the tradeoff between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. We prove a lower bound of \Omega\Gamma n log m m ) on the time to sort n numbers in an exclusiveread variant of the PRAM(m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that bot...