Results 11  20
of
45
Analyses of Load Stealing Models Based on Differential Equations
 In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1998
"... In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to m ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to model a large variety of systems and to provide accurate numerical approximations of system behavior even when the number of processors is relatively small. We show how this approach can yield significant intuition about the behavior of work stealing algorithms in realistic settings.
Scheduling Threads for Low Space Requirement and Good Locality
 In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1999
"... The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overh ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overheads and good locality of work stealing with the low space requirements of depthfirst schedulers. For a nestedparallel program with depth D and serial space requirement S 1 , we show that the expected space requirement is S 1 +O(K \Delta p \Delta D) on p processors. Here, K is a useradjustable runtime parameter, which provides a tradeoff between running time and space requirement. Our algorithm achieves good locality and low scheduling overheads by automatically increasing the granularity of the work scheduled on each processor. We have implemented the new scheduling algorithm in the context of a native, userlevel implementation of Posix standard threads or Pthreads, and evaluated its p...
Scalable Load Balancing Strategies for Parallel A* Algorithms
 Journal of Parallel and Distributed Computing
, 1994
"... In this paper, we develop load balancing strategies for scalable highperformance parallel A* algorithms suitable for distributedmemory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algori ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
In this paper, we develop load balancing strategies for scalable highperformance parallel A* algorithms suitable for distributedmemory machines. In parallel A* search, inefficiencies such as processor starvation and search of nonessential spaces (search spaces not explored by the sequential algorithm) grow with the number of processors P used, thus restricting its scalability. To alleviate this effect, we propose a novel parallel startup phase and an efficient dynamic load balancing strategy called the quality equalizing (QE) strategy. Our new parallel startup scheme executes optimally in \Theta(logP ) time and, in addition, achieves good initial load balance. The QE strategy possesses certain unique quantitative and qualitative load balancing properties that enable it to significantly reduce starvation and nonessential work. Consequently, we obtain a highly scalable parallel A* algorithm with an almostlinear speedup. The startup and load balancing schemes were employed in parallel ...
Parallelism and Locality in Priority Queues
 In Sixth IEEE Sypmposium on Parallel and Distributed Processing
, 1994
"... We explore two ways of incorporating parallelism into priority queues. The first is to speed up the execution of individual priority operations so that they can be performed one operation per time step, unlike sequential implementations which require O(log N ) time steps per operation for an N eleme ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
We explore two ways of incorporating parallelism into priority queues. The first is to speed up the execution of individual priority operations so that they can be performed one operation per time step, unlike sequential implementations which require O(log N ) time steps per operation for an N element heap. We give an optimal parallel implementation that uses a linear array of O(log N ) processors. Second, we consider parallel operations on the priority queue. We show that using a ddimensional array (constant d) of P processors we can insert or delete the smallest P elements from a heap in time O(P 1=d log 1\Gamma1=d P ), where the number of elements in the heap is assumed to be polynomial in P . We also show a matching lower bound, based on communication complexity arguments, for a range of deterministic implementations. Finally, using randomization, we show that the time can be reduced to the optimal O(P 1=d ) time with high probability. 1 Introduction Much of the theoret...
Adaptive Work Stealing with Parallelism Feedback
"... Abstract We present an adaptive workstealing thread scheduler, ASTEAL, for forkjoin multithreaded jobs, like those written using the Cilk multithreaded language or the Hood workstealinglibrary. The ASTEAL algorithm is appropriate for large parallel servers where many jobs share a common multipr ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
Abstract We present an adaptive workstealing thread scheduler, ASTEAL, for forkjoin multithreaded jobs, like those written using the Cilk multithreaded language or the Hood workstealinglibrary. The ASTEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessorresource and in which the number of processors available to a particular job may vary during the job's execution. ASTEALprovides continual parallelism feedback to a job scheduler in the form of processor requests, and the job must adapt its execution to the processors allotted to it. Assuming that the job scheduler never allots any job more processors than requestedby the job's thread scheduler, ASTEAL guarantees that the job completes in nearoptimal time while utilizing at least a constant fraction of the allotted processors. Our analysis models the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the system environment and the job scheduler's administrative policies. We analyze the performance of ASTEAL using "trim analysis, " which allows us to prove that our thread scheduler performs poorly on at most a small number of time steps, while exhibiting nearoptimal behavior on the vast majority.To be precise, suppose that a job has work T1 and criticalpath length T1. On a machine with P processors, ASTEALcompletes the job in expected O(T1/eP + T1 + L lg P) timesteps, where L is the length of a scheduling quantum and ePdenotes the O(T1 + L lg P)trimmed availability. This quantity is the average of the processor availability over all but
Performance Evaluation of Load Distribution Strategies in Parallel Branch and Bound Computations
 in Parallel Branch and Bound Computations Proc. 7th Symposium on Parallel and Distributed Processing (SPDP'95
, 1995
"... Load distribution is essential for efficient use of processors in parallel branchandbound computations because the computation generates and consumes nonuniform subproblems at runtime. This paper presents six decentralized load distribution strategies. They are incorporated in a runtime support s ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Load distribution is essential for efficient use of processors in parallel branchandbound computations because the computation generates and consumes nonuniform subproblems at runtime. This paper presents six decentralized load distribution strategies. They are incorporated in a runtime support system, and evaluated in the solution of set partitioning problems on two parallel computer systems. It is observed that local averaging strategies outperform the randomized allocation and the Acwn algorithm significantly in large scale system. They lead to an almost linear speedup in a PowerPCbased system with up to 32 nodes and to a speedup of 146.8 in a Transputerbased system with 256 nodes. It is also observed that the randomized allocation and the Acwn algorithm can be improved by 10% to 15% when the subproblem bound information is used in the decisionmaking. 1 Introduction Branchandbound is a wellknown technique for solving combinatorial search problems [4]. Its basic scheme is t...
Distributed data structures and algorithms for Gröbner basis computation
 Lisp and Symbolic Computation
, 1994
"... We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointerbased structures and the computations on the structures have unpre ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
We present the design and implementation of a parallel algorithm for computing Gröbner bases on distributed memory multiprocessors. The parallel algorithm is irregular both in space and time: the data structures are dynamic pointerbased structures and the computations on the structures have unpredictable duration. The algorithm is presented as a series of refinements on a transition rule program, in which computation proceeds by nondeterministic invocations of guarded commands. Two key data structures, a set and a priority queue, are distributed across processors in the parallel algorithm. The data structures are designed for high throughput and latency tolerance, as appropriate for distributed memory machines. The programming style represents a compromise between sharedmemory and messagepassing models. The distributed nature of the data structures shows through their interface in that the semantics are weaker than with shared atomic objects, but they still provide a shared abstraction that can be used for reasoning about program correctness. In the data structure design there is a classic tradeoff between locality and load balance. We argue that this is best solved by designing scheduling structures in tandem with the state data structures, since the decision to replicate or partition state affects the overhead of dynamically moving tasks.
The Parallel Implementation of Nbody Algorithms
, 1994
"... This dissertation studies issues critical to efficient Nbody simulations on parallel computers. The Nbody problem poses several challenges for distributedmemory implementation: adaptive distributed data structures, irregular data access patterns, and irregular and adaptive communication patterns. ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
This dissertation studies issues critical to efficient Nbody simulations on parallel computers. The Nbody problem poses several challenges for distributedmemory implementation: adaptive distributed data structures, irregular data access patterns, and irregular and adaptive communication patterns. We introduce new techniques to maintain dynamic irregular data structures, to vectorize irregular computational structures, and for efficient communication. We report results from experiments on the Connection Machine CM5. The results demonstrate the performance advantages of design simplicity; the code provides generality of use on various messagepassing architectures. Our methods have been used as the basis of a C++ library that provides abstractions for tree computations to ease the development of different Nbody codes. This dissertation also presents the atomic message model to capture the important factors of efficient communication in messagepassing systems. The atomic model was m...
Concurrent Heaps on the BSP Model
, 1996
"... In this paper we present a new randomized selection algorithm on the BulkSynchronous Parallel (BSP) model of computation along with an application of this algorithm to dynamic data structures, namely Parallel Priority Queues (PPQs). We show that our algorithms improve previous results upon both the ..."
Abstract

Cited by 11 (11 self)
 Add to MetaCart
In this paper we present a new randomized selection algorithm on the BulkSynchronous Parallel (BSP) model of computation along with an application of this algorithm to dynamic data structures, namely Parallel Priority Queues (PPQs). We show that our algorithms improve previous results upon both the communication requirements and the amount of parallel slack required to achieve optimal performance. We also establish that optimality to within small multiplicative constant factors can be achieved for a wide range of parallel machines. While these algorithms are fairly simple themselves, descriptions of their performance in terms of the BSP parameters is somewhat involved. The main reward of quantifying these complications is that it allows transportable software to be written for parallel machines that fit the model. We also present experimental results for the selection algorithm that reinforce our claims.
Distributed mining of molecular fragments
 Proc. of IEEE DMGrid, Workshop on Data Mining and Grid of IEEE ICDM
, 2004
"... In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, highdimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this contex ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, highdimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the wellknown National Cancer Institute’s HIVscreening dataset. We present experimental results on a smallscale computing environment. 1.