Results 11  20
of
175
Runtime support and compilation methods for userspecified irregular data distributions
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1995
"... This paper describes two new ideas by which a High Performance Fortran compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph c ..."
Abstract

Cited by 59 (11 self)
 Add to MetaCart
(Show Context)
This paper describes two new ideas by which a High Performance Fortran compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of proposed compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements, and computational load. The second mechanism is a conservative method for compiling irregular loops in which dependence arises only due to reduction operations. This mechanism in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g., communication schedules, loop iteration partitions, and information that associates offprocessor data copies with onprocessor buffer locations). This paper also presents performance results for these mechanisms from a Fortran 90D compiler implementation.
TIGHT ANALYSES OF TWO LOCAL LOAD BALANCING ALGORITHMS
 SIAM J. COMPUT.
, 1999
"... This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We ..."
Abstract

Cited by 52 (5 self)
 Add to MetaCart
This paper presents an analysis of the following load balancing algorithm. At each step, each node in a network examines the number of tokens at each of its neighbors and sends a token to each neighbor with at least 2d + 1 fewer tokens, where d is the maximum degree of any node in the network. We show that within O(∆/α) steps, the algorithm reduces the maximum difference in tokens between any two nodes to at most O((d 2 log n)/α), where ∆ is the global imbalance in tokens (i.e., the maximum difference between the number of tokens at any node initially and the average number of tokens), n is the number of nodes in the network, and α is the edge expansion of the network. The time bound is tight in the sense that for any graph with edge expansion α, and for any value ∆, there exists an initial distribution of tokens with imbalance ∆ for which the time to reduce the imbalance to even ∆/2 is at least Ω(∆/α). The bound on the final imbalance is tight in the sense that there exists a class of networks that can be locally balanced everywhere (i.e., the maximum difference in tokens between any two neighbors is at most 2d), while the global imbalance remains Ω((d 2 log n)/α). Furthermore, we show that upon reaching a state with a global imbalance of O((d 2 log n)/α), the time for this algorithm to locally balance the network can be as large as Ω(n 1/2). We extend our analysis to a variant of this algorithm for dynamic and asynchronous
Graph Partitioning Algorithms With Applications To Scientific Computing
 Parallel Numerical Algorithms
, 1997
"... Identifying the parallelism in a problem by partitioning its data and tasks among the processors of a parallel computer is a fundamental issue in parallel computing. This problem can be modeled as a graph partitioning problem in which the vertices of a graph are divided into a specified number of su ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
(Show Context)
Identifying the parallelism in a problem by partitioning its data and tasks among the processors of a parallel computer is a fundamental issue in parallel computing. This problem can be modeled as a graph partitioning problem in which the vertices of a graph are divided into a specified number of subsets such that few edges join two vertices in different subsets. Several new graph partitioning algorithms have been developed in the past few years, and we survey some of this activity. We describe the terminology associated with graph partitioning, the complexity of computing good separators, and graphs that have good separators. We then discuss early algorithms for graph partitioning, followed by three new algorithms based on geometric, algebraic, and multilevel ideas. The algebraic algorithm relies on an eigenvector of a Laplacian matrix associated with the graph to compute the partition. The algebraic algorithm is justified by formulating graph partitioning as a quadratic assignment p...
Runtime Compilation Techniques for Data Partitioning and Communication Schedule Reuse
 PROCEEDINGS OF THE 1993 ACM/IEEE CONFERENCE ON SUPERCOMPUTING
, 1993
"... In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow the user to use progmm arrays to describe graph connectivity, spatial ..."
Abstract

Cited by 48 (2 self)
 Add to MetaCart
In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow the user to use progmm arrays to describe graph connectivity, spatial location of army elements and computational load. The second is a simple conservative method that in many casea enables a compiler to recognize that it is possible to reuse previously computed results from inspectors (e.g. communication schedules, loop iteration partitions, information that associates offprocessor data copies with onprocessor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
A New Paradigm for Parallel Adaptive Meshing Algorithms
 SIAM J. Sci. Comput
, 2003
"... We present a new approach to the use of parallel computers with adaptive finite element methods. This approach addresses the load balancing problem in a new way, requiring far less communication than current approaches. It also allows existing sequential adaptive PDE codes such as PLTMG and MC to ru ..."
Abstract

Cited by 46 (9 self)
 Add to MetaCart
(Show Context)
We present a new approach to the use of parallel computers with adaptive finite element methods. This approach addresses the load balancing problem in a new way, requiring far less communication than current approaches. It also allows existing sequential adaptive PDE codes such as PLTMG and MC to run in a parallel environment without a large investment in recoding. In this new approach, the load balancing problem is reduced to the numerical solution of a small elliptic problem on a single processor, using a sequential adaptive solver, without requiring any modifications to the sequential solver. The small elliptic problem is used to produce a posteriori error estimates to predict future element densities in the mesh, which are then used in a weighted recursive spectral bisection of the initial mesh. The bulk of the calculation then takes place independently on each processor, with no communication, using possibly the same sequential adaptive solver. Each processor adapts its region of the mesh independently, and a nearly loadbalanced mesh distribution is usually obtained as a result of the initial weighted spectral bisection. Only the initial fanout of the mesh decomposition to the processors requires communication. Two additional steps requiring boundary exchange communication may be employed after the individual processors reach an adapted solution, namely, the construction of a global conforming mesh from the independent subproblems, followed by a final smoothing phase using the subdomain solutions as an initial guess. We present a series of convincing numerical experiments that illustrate the e#ectiveness of this approach. The justification of the initial refinement prediction step, as well as the justification of skipping the two communicationintensive steps, ...
An Optimal Dynamic Load Balancing Algorithm
 Daresbury Laboratory
, 1995
"... The problem of redistributing work load on parallel computers is considered. An optimal redistribution algorithm, which minimises the Euclidean norm of the migrating load, is derived. The problem is further studied by modelling with the unsteady heat conduction equation. Relationship between this al ..."
Abstract

Cited by 45 (0 self)
 Add to MetaCart
The problem of redistributing work load on parallel computers is considered. An optimal redistribution algorithm, which minimises the Euclidean norm of the migrating load, is derived. The problem is further studied by modelling with the unsteady heat conduction equation. Relationship between this algorithm and other dynamic load balancing algorithms is discussed. Convergence of the algorithm for special graphs is studied. Finally numerical results on randomly generated graphs are given to demonstrate the effectiveness of the algorithm. 1. Introduction To achieve good performance on a parallel computer, it is essential to maintain a balanced work load among all the processors of the computer. Sometimes the load can be balanced statically. However in many cases the load on each processor can not be predicted a priori. One example that demonstrates the need for both static and dynamic load balancing strategies, which is also the main motivation for this paper, is in the parallel finite e...
Runtime and language support for compiling adaptive irregular programs on distributed memory machines
 SOFTWARE—PRACTICE AND EXPERIENCE
, 1995
"... In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHA ..."
Abstract

Cited by 44 (3 self)
 Add to MetaCart
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributedmemory messagepassing machines, such as the Paragon, Delta, CM5 and SP1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient index translation mechanisms and provides users highlevel mechanisms for optimizing communication. CHAOS subsumes the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirection arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging reallife adaptive applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particleincell code, DSMC. Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how CHAOS can be effectively used in such a framework. By embedding CHAOS primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized. KEY WORDS: distributedmemory multiprocessors; runtime compilation; adaptive irregular programs; parallelizing compilers
Massively Parallel Methods for Engineering and Science Problems
 COMM. ACM
, 1994
"... Massively parallel computers promise unique power for engineering and scientific simulations. Development of parallel software tools and algorithms that take full advantage of this power is a continuing challenge to computational researchers. In this paper we discuss the advantages a messagepassi ..."
Abstract

Cited by 43 (0 self)
 Add to MetaCart
Massively parallel computers promise unique power for engineering and scientific simulations. Development of parallel software tools and algorithms that take full advantage of this power is a continuing challenge to computational researchers. In this paper we discuss the advantages a messagepassing multipleinstruction/multipledata (MIMD) programming approach has for parallel simulations and two domaindecomposition and loadbalancing methods designed to optimize data locality and minimize communication costs on large parallel machines. The methods have proven useful for parallel computations in a broad range of technical fields including structural and fluid dynamics and chemical, biological, and materials science.
A Parallel Dynamic Load Balancing Algorithm for 3D Adaptive Unstructured Grids
 AIAA Journal
, 1993
"... Adaptive local grid refinement/coarsening results in unequal distribution of workload among the processors of a parallel system. A novel method for balancing the load in cases of dynamically changing tetrahedral grids is developed. The approach Graduate Research Assistant, Dept. of Electrical and Co ..."
Abstract

Cited by 42 (5 self)
 Add to MetaCart
Adaptive local grid refinement/coarsening results in unequal distribution of workload among the processors of a parallel system. A novel method for balancing the load in cases of dynamically changing tetrahedral grids is developed. The approach Graduate Research Assistant, Dept. of Electrical and Computer Engineering y Assistant Professor, Member AIAA z Research Scientist, Member AIAA 1 employs local exchange of cells among processors in order to redistribute the load equally. An important part of the load balancing algorithm is the method employed by a processor to determine which cells within its subdomain are to be exchanged. Two such methods are presented and compared. The strategy for load balancing is based on the DivideandConquer approach which leads to an efficient parallel algorithm. This method is implemented on a distributedmemory MIMD system. 1 Introduction Computational fluid dynamics (CFD) has advanced rapidly over the last two decades and it is recognized as a...
Parallel Decomposition of Unstructured FEMMeshes
 Concurrency: Practice & Experience
, 1995
"... . We present a massively parallel algorithm for static and dynamic partitioning of unstructured FEMmeshes. The method consists of two parts. First a fast but inaccurate sequential clustering is determined which is used, together with a simple mapping heuristic, to map the mesh initially onto the pr ..."
Abstract

Cited by 42 (14 self)
 Add to MetaCart
(Show Context)
. We present a massively parallel algorithm for static and dynamic partitioning of unstructured FEMmeshes. The method consists of two parts. First a fast but inaccurate sequential clustering is determined which is used, together with a simple mapping heuristic, to map the mesh initially onto the processors of a massively parallel system. The second part of the method uses a massively parallel algorithm to remap and optimize the mesh decomposition taking several cost functions into account. It first calculates the amount of nodes that have to be migrated between pairs of clusters in order to obtain an optimal load balancing. In a second step, nodes to be migrated are chosen according to cost functions optimizing the amount and necessary communication and other measures which are important for the numerical solution method (like for example the aspect ratio of the resulting domains). The parallel parts of the method are implemented in C under Parix to run on the Parsytec GCel systems. R...