Results 11 - 20
of
43
Analyses of Load Stealing Models Based on Differential Equations
- In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1998
"... In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to m ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to model a large variety of systems and to provide accurate numerical approximations of system behavior even when the number of processors is relatively small. We show how this approach can yield significant intuition about the behavior of work stealing algorithms in realistic settings.
Scalable Load Balancing Strategies for Parallel A* Algorithms
- Journal of Parallel and Distributed Computing
, 1994
"... In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algori ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
In this paper, we develop load balancing strategies for scalable high-performance parallel A* algorithms suitable for distributed-memory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in \Theta(logP ) time and, in addition, achieves good initial load balance. The QE strategy possesses certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almost-linear speedup. The startup and load balancing schemes were employed in parallel ...
Parallelism and Locality in Priority Queues
- In Sixth IEEE Sypmposium on Parallel and Distributed Processing
, 1994
"... We explore two ways of incorporating parallelism into priority queues. The first is to speed up the execution of individual priority operations so that they can be performed one operation per time step, unlike sequential implementations which require O(log N ) time steps per operation for an N eleme ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We explore two ways of incorporating parallelism into priority queues. The first is to speed up the execution of individual priority operations so that they can be performed one operation per time step, unlike sequential implementations which require O(log N ) time steps per operation for an N element heap. We give an optimal parallel implementation that uses a linear array of O(log N ) processors. Second, we consider parallel operations on the priority queue. We show that using a d-dimensional array (constant d) of P processors we can insert or delete the smallest P elements from a heap in time O(P 1=d log 1\Gamma1=d P ), where the number of elements in the heap is assumed to be polynomial in P . We also show a matching lower bound, based on communication complexity arguments, for a range of deterministic implementations. Finally, using randomization, we show that the time can be reduced to the optimal O(P 1=d ) time with high probability. 1 Introduction Much of the theoret...
Performance Evaluation of Load Distribution Strategies in Parallel Branch and Bound Computations
- in Parallel Branch and Bound Computations Proc. 7th Symposium on Parallel and Distributed Processing (SPDP'95
, 1995
"... Load distribution is essential for efficient use of processors in parallel branch-and-bound computations because the computation generates and consumes non-uniform subproblems at runtime. This paper presents six decentralized load distribution strategies. They are incorporated in a runtime support s ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Load distribution is essential for efficient use of processors in parallel branch-and-bound computations because the computation generates and consumes non-uniform subproblems at runtime. This paper presents six decentralized load distribution strategies. They are incorporated in a runtime support system, and evaluated in the solution of set partitioning problems on two parallel computer systems. It is observed that local averaging strategies outperform the randomized allocation and the Acwn algorithm significantly in large scale system. They lead to an almost linear speedup in a PowerPC-based system with up to 32 nodes and to a speedup of 146.8 in a Transputer-based system with 256 nodes. It is also observed that the randomized allocation and the Acwn algorithm can be improved by 10% to 15% when the subproblem bound information is used in the decisionmaking. 1 Introduction Branch-and-bound is a well-known technique for solving combinatorial search problems [4]. Its basic scheme is t...
Distributed data structures and algorithms for Gröbner basis computation
- Lisp and Symbolic Computation
, 1994
"... We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointer-based structures and the computations on the structures have unpre ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointer-based structures and the computations on the structures have unpredictable duration. The algorithm is presented as a series of refinements on a transition rule program, in which computation proceeds by nondeterministic invocations of guarded commands. Two key data structures, a set and a priority queue, are distributed across processors in the parallel algorithm. The data structures are designed for high throughput and latency tolerance, as appropriate for distributed memory machines. The programming style represents a compromise between shared-memory and message-passing models. The distributed nature of the data structures shows through their interface in that the semantics are weaker than with shared atomic objects, but they still provide a shared abstraction that can be used for reasoning about program correctness. In the data structure design there is a classic trade-off between locality and load balance. We argue that this is best solved by designing scheduling structures in tandem with the state data structures, since the decision to replicate or partition state affects the overhead of dynamically moving tasks.
The Parallel Implementation of N-body Algorithms
, 1994
"... This dissertation studies issues critical to efficient N-body simulations on parallel computers. The N-body problem poses several challenges for distributed-memory implementation: adaptive distributed data structures, irregular data access patterns, and irregular and adaptive communication patterns. ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
This dissertation studies issues critical to efficient N-body simulations on parallel computers. The N-body problem poses several challenges for distributed-memory implementation: adaptive distributed data structures, irregular data access patterns, and irregular and adaptive communication patterns. We introduce new techniques to maintain dynamic irregular data structures, to vectorize irregular computational structures, and for efficient communication. We report results from experiments on the Connection Machine CM-5. The results demonstrate the performance advantages of design simplicity; the code provides generality of use on various message-passing architectures. Our methods have been used as the basis of a C++ library that provides abstractions for tree computations to ease the development of different N-body codes. This dissertation also presents the atomic message model to capture the important factors of efficient communication in message-passing systems. The atomic model was m...
Concurrent Heaps on the BSP Model
, 1996
"... In this paper we present a new randomized selection algorithm on the Bulk-Synchronous Parallel (BSP) model of computation along with an application of this algorithm to dynamic data structures, namely Parallel Priority Queues (PPQs). We show that our algorithms improve previous results upon both the ..."
Abstract
-
Cited by 11 (11 self)
- Add to MetaCart
In this paper we present a new randomized selection algorithm on the Bulk-Synchronous Parallel (BSP) model of computation along with an application of this algorithm to dynamic data structures, namely Parallel Priority Queues (PPQs). We show that our algorithms improve previous results upon both the communication requirements and the amount of parallel slack required to achieve optimal performance. We also establish that optimality to within small multiplicative constant factors can be achieved for a wide range of parallel machines. While these algorithms are fairly simple themselves, descriptions of their performance in terms of the BSP parameters is somewhat involved. The main reward of quantifying these complications is that it allows transportable software to be written for parallel machines that fit the model. We also present experimental results for the selection algorithm that reinforce our claims.
Distributed mining of molecular fragments
- Proc. of IEEE DMGrid, Workshop on Data Mining and Grid of IEEE ICDM
, 2004
"... In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this contex ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the wellknown National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment. 1.
Parallel A* Algorithms and their Performance on Hypercube Multiprocessors
, 1993
"... In this paper we develop parallel A* algorithms suitable for distributed-memory machines. In parallel A* algorithms, inefficiencies grow with the number of processors P used, causing performance to drop significantly at lower and intermediate work densities (the ratio of the problem size to P ). To ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In this paper we develop parallel A* algorithms suitable for distributed-memory machines. In parallel A* algorithms, inefficiencies grow with the number of processors P used, causing performance to drop significantly at lower and intermediate work densities (the ratio of the problem size to P ). To alleviate this effect, we propose a novel parallel startup phase and efficient dynamic work distribution strategies, and thus improve the scalability of parallel A* search. We also tackle the problem of duplicate searching by different processors, by using work transfer as a means to partial duplicate pruning. The parallel startup scheme proposed requires only \Theta(logP ) time compared to \Theta(P ) time for sequential startup methods used in the past. Using the Traveling Salesman Problem (TSP) as our test case, we see that our work distribution strategies yield speedup improvements of more than 30% and 15% at lower and intermediate work densities, respectively, while requiring 20% to 45%...
Tight Bounds For On-Line Tree Embeddings
- In Proceedings of the 2nd ACM-SIAM Symposium On Discrete Algorithms
, 1991
"... . Tree-structured computations are relatively easy to process in parallel. As leaf processes are recursively spawned they can be assigned to independent processors in a multicomputer network. However, to achieve good performance the on-line mapping algorithm must maintain load balance, i.e., distrib ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
. Tree-structured computations are relatively easy to process in parallel. As leaf processes are recursively spawned they can be assigned to independent processors in a multicomputer network. However, to achieve good performance the on-line mapping algorithm must maintain load balance, i.e., distribute processes equitably among processors. Additionally, the algorithm itself must be distributed in nature, and process allocation must be completed via message-passing with minimal communication overhead. This paper investigates bounds on the performance of deterministic and randomized algorithms for on-line tree embeddings. In particular, we study trade-o#s between computation overhead (load imbalance) and communication overhead (message congestion). We give a simple technique to derive lower bounds on the congestion that any on-line allocation algorithm must incur in order to guarantee load balance. This technique works for both randomized and deterministic algorithms. We prove that the a...

