## Dynamic Load Balancing in Computational Mechanics (0)

Venue: | Computer Methods in Applied Mechanics and Engineering |

Citations: | 34 - 2 self |

### BibTeX

@INPROCEEDINGS{Hendrickson_dynamicload,

author = {Bruce Hendrickson and Karen Devine},

title = {Dynamic Load Balancing in Computational Mechanics},

booktitle = {Computer Methods in Applied Mechanics and Engineering},

year = {},

pages = {485--500}

}

### Years of Citing Articles

### OpenURL

### Abstract

. In many important computational mechanics applications, the computation adapts dynamically during the simulation. Examples include adaptive mesh refinement, particle simulations and transient dynamics calculations. When running these kinds of simulations on a parallel computer, the work must be assigned to processors in a dynamic fashion to keep the computational load balanced. A number of approaches have been proposed for this dynamic load balancing problem. This paper reviews the major classes of algorithms, and discusses their relative merits on problems from computational mechanics. Shortcomings in the state-of-the-art are identified and suggestions are made for future research directions. Key words. dynamic load balancing, parallel computer, adaptive mesh refinement 1. Introduction. The efficient use of a parallel computer requires two, often competing, objectives to be achieved. First, the processors must be kept busy doing useful work. And second, the amount of interprocess...

### Citations

1050 |
An ef�cient heuristic procedure for partitioning graphs
- Kernighan, Lin
- 1970
(Show Context)
Citation Context ...ing processors, optimizing the shapes of the subdomains, or combinations of these goals. Many load-balancing algorithms use a version of the gain criteria from the algorithm by Kernighan and Lin (KL) =-=[33]-=- to select objects to transfer (e.g., [11, 24, 56, 61, 63]). For each of a processor's objects, the gain of transferring the objects to another processor is computed. For example, to minimize edge-cut... |

794 | A fast and high quality multilevel scheme for partitioning irregular graphs
- Karypis, Kumar
- 1998
(Show Context)
Citation Context ...nsight is generalized to apply to a class of graph partitioning algorithms, including the multilevel methods described below. The most popular static partitioning algorithms are multilevel techniques =-=[6, 24, 30]-=-. These methods construct a sequence of smaller and smaller approximations to the graph. The smallest graph in this sequence is partitioned. Then this partition is propagated back through the intermed... |

488 |
Partitioning sparse matrices with eigenvectors of graphs
- Pothen, Simon, et al.
- 1990
(Show Context)
Citation Context ...titions. In situations where they are applied infrequently, they can be very useful partitioners. One of the more popular static partitioning algorithms is known as Recursive Spectral Bisection (RSB) =-=[43, 48]-=-. This approach uses an eigenvector of a matrix associated with the graph to partition the vertices. Although it usually produces high quality partitions, the eigenvector calculation is very expensive... |

339 |
Dynamic load balancing for distributed memory multiprocessors
- Cybenko
- 1989
(Show Context)
Citation Context ... many methods have been suggested. 3.4.1. Determining Work Flow. To determine the flow of work between processors, a diffusive algorithm is often used. These algorithms were first proposed by Cybenko =-=[9]-=-. In their simplest form, they model the processor work loads by the heat equation @u=@t = ffr 2 u (3.1) where u is the work load and ff is a diffusion coefficient. Using the processors' hardware conn... |

289 |
Partitioning of unstructured problems for parallel processing
- Simon
- 1991
(Show Context)
Citation Context ...tion. The approach runs quite quickly, and is fairly simple to implement. An alternative approach which also uses cutting planes is the Recursive Inertial Bisection (RIB) algorithm described by Simon =-=[48]-=-. This algorithm doesn't confine its cutting planes to be orthogonal to an axis. Instead, it tries to find a naturally long direction in the object distribution automatically, using a mechanical princ... |

233 |
A partitioning strategy for nonuniform problems on multiprocessors
- Berger
- 1987
(Show Context)
Citation Context ...dividing the domain using lines or planes. The simplest such method is known as Recursive Coordinate Bisection (RCB), and was first proposed as a static load balancing algorithm by Berger and Bokhari =-=[5]-=-. The name of the algorithm is due to the use of cutting planes that are orthogonal to one of the coordinate axes. The algorithm takes the geometric locations of all the objects, determines in which c... |

189 |
The Chaco user’s guide: Version 2.0
- Hendrickson, Leland
- 1995
(Show Context)
Citation Context ...ence is partitioned. Then this partition is propagated back through the intermediate graphs, periodically being refined. Although good sequential implementations of this algorithm have been developed =-=[23, 30]-=-, parallel implementations have proved to be quite difficult. Both the construction of the smaller approximations and the refinement operation are challenging to parallelize. These challenges have bee... |

165 |
Automatic three dimensional mesh generation by the modified-octree technique
- Yerry, Shepherd
- 1965
(Show Context)
Citation Context ...hen a geometric region is divided, each of the eight octants becomes a child of the vertex representing the region. This data structure is widely used for mesh generation and adaptive mesh refinement =-=[2, 11, 36, 47]-=-. Octrees can also be used for partitioning. A traversal of the tree defines a global ordering on the tree leaves, which correspond to individual objects. This ordered list can then be sliced to gener... |

155 |
Strategies for dynamic load balancing on highly parallel computers
- Willebeek-LeMair, Reeves
- 1993
(Show Context)
Citation Context ...equire operating system support and be less portable than synchronous implementations. Other implementations include message checking within the application, complicating the logic of the application =-=[18, 62]-=-. Since many parallel scientific applications have natural synchronization points, synchronous algorithms can often be used successfully, without the added complication of asynchronous algorithms. Man... |

147 | A parallel hashed oct-tree n-body algorithm - Warren, Salmon - 1993 |

117 |
Parmetis." Parallel graph partitioning and sparse matrix ordering library. Version 2
- Karypis, Schloegel, et al.
- 2003
(Show Context)
Citation Context ... quite difficult. Both the construction of the smaller approximations and the refinement operation are challenging to parallelize. These challenges have been addressed in two recent efforts: ParMETIS =-=[31, 32, 46]-=- and JOSTLE [54, 55]. These tools essentially perform a local improvement like those described in x3.4, but they use a multilevel approach to select which objects to move. This makes them more powerfu... |

102 | Geometric mesh partitioning: Implementation and experiments
- Gilbert, Miller, et al.
(Show Context)
Citation Context ...nd for all well-shaped meshes. Unlike the result for cutting with planes, this proof does not require any additional constraints on the sizes of mesh elements. Experiments by Gilbert, Miller and Teng =-=[28]-=- show that this approach can generate partitions of quality comparable to those produced by graph partitioning algorithms. However, this algorithm is considerably more complex and expensive than the s... |

92 |
An unified geometric approach to graph separators
- Miller, Teng, et al.
- 1991
(Show Context)
Citation Context ...ion to applications which possess geometric locality. A more sophisticated partitioning approach developed by Miller, Teng and Vavasis uses circles or spheres instead of planes to divide the geometry =-=[35]-=-. The interior of the sphere is one partition while the exterior is the other. This algorithm has very attractive theoretical properties -- it comes within a constant factor of the best possible bound... |

77 | Quality mesh generation in three dimensions
- Mitchell, Vavasis
- 1992
(Show Context)
Citation Context ...hen a geometric region is divided, each of the eight octants becomes a child of the vertex representing the region. This data structure is widely used for mesh generation and adaptive mesh refinement =-=[2, 11, 36, 47]-=-. Octrees can also be used for partitioning. A traversal of the tree defines a global ordering on the tree leaves, which correspond to individual objects. This ordered list can then be sliced to gener... |

70 |
A heuristic for reducing fill in sparse matrix factorization
- Bui, Jones
- 1993
(Show Context)
Citation Context ...nsight is generalized to apply to a class of graph partitioning algorithms, including the multilevel methods described below. The most popular static partitioning algorithms are multilevel techniques =-=[6, 24, 30]-=-. These methods construct a sequence of smaller and smaller approximations to the graph. The smallest graph in this sequence is partitioned. Then this partition is propagated back through the intermed... |

69 | A Practical Approach to Dynamic Load Balancing
- Watts, Taylor
- 1998
(Show Context)
Citation Context ... been proposed to accelerate the convergence of diffusion methods. For example, diffusion methods have been used with parallel multilevel methods [26, 46, 54, 55], as described in x3.5. Watts, et al. =-=[58, 59]-=- propose using a second-order implicit finite discretization of Eq. (3.1) to compute work transfers. This scheme converges to global balance in fewer iterations, but requires a bit more work and commu... |

49 |
A multi-level diffusion method for dynamic load balancing
- Horton
- 1993
(Show Context)
Citation Context ...pattern; see [9, 14] for details. Several methods have been proposed to accelerate the convergence of diffusion methods. For example, diffusion methods have been used with parallel multilevel methods =-=[26, 46, 54, 55]-=-, as described in x3.5. Watts, et al. [58, 59] propose using a second-order implicit finite discretization of Eq. (3.1) to compute work transfers. This scheme converges to global balance in fewer iter... |

46 | Mapping unstructured grid computations to massively parallel computers
- Hammond
- 1992
(Show Context)
Citation Context ...the number of collisions. However, thrashing may occur over several iterations; additional stopping criteria are needed to end the iterations when the cost of the partition has not decreased. Hammond =-=[22]-=- performs pairwise exchanges of objects between pairs of processors to improve an existing decomposition. The processor graph is edge-colored to allow parallel computation between pairs of adjacent pr... |

45 | Adaptive local refinement with octree load balancing for the parallel solution of threedimensional conservation laws
- Flaherty, Loy, et al.
- 1997
(Show Context)
Citation Context ...obal ordering on the tree leaves, which correspond to individual objects. This ordered list can then be sliced to generate any number of partitions. This basic algorithm is called Octree Partitioning =-=[10, 19]-=-, or Space-Filling Curve (SFC) Partitioning. This approach was first used by Warren and Salmon for gravitational simulations [57]. Patra and Oden were the first to apply it to adaptive mesh refinement... |

42 | Parallel decomposition of unstructured FEMmeshes
- Diekmann, Meyer, et al.
- 1998
(Show Context)
Citation Context ...ver, criteria other than subdomain interface size become important. For domain decomposition linear solvers, for example, the aspect ratio of the subdomains affects the convergence of the solvers. In =-=[12, 53]-=-, the cost function to be minimized is a weighted combination of the load imbalance and the subdomain aspect ratio. Thus, objects whose coordinates are farthest from the average coordinates of all the... |

40 | An optimal dynamic load balancing algorithm
- Hu, Blake
- 1995
(Show Context)
Citation Context ...etization of Eq. (3.1) to compute work transfers. This scheme converges to global balance in fewer iterations, but requires a bit more work and communication per iteration. The method of Hu and Blake =-=[27]-=- is used in several parallel decomposition packages [46, 56]. They compute a diffusion solution while minimizing the flow of work over the edges of the processor graph, enforcing incrementality. To co... |

38 |
Load balancing for the parallel adaptive solution of partial differential equations
- deCougny, Devine, et al.
- 1994
(Show Context)
Citation Context ...obal ordering on the tree leaves, which correspond to individual objects. This ordered list can then be sliced to generate any number of partitions. This basic algorithm is called Octree Partitioning =-=[10, 19]-=-, or Space-Filling Curve (SFC) Partitioning. This approach was first used by Warren and Salmon for gravitational simulations [57]. Patra and Oden were the first to apply it to adaptive mesh refinement... |

36 | PMRSB: Parallel multilevel recursive spectral bisection
- Barnard
- 1995
(Show Context)
Citation Context .... Although it usually produces high quality partitions, the eigenvector calculation is very expensive. Barnard tried Dynamic Load Balancing 13 to circumvent this problem via a parallel implementation =-=[3]-=-. The result is primarily useful for static partitioning, but it can also be used in a dynamic setting. However, the eigenvector calculation is very expensive relative to the geometric methods and the... |

32 |
A Localised Algorithm for Optimising Unstructured Mesh Partitions
- Walshaw, Cross, et al.
- 1995
(Show Context)
Citation Context ...pattern; see [9, 14] for details. Several methods have been proposed to accelerate the convergence of diffusion methods. For example, diffusion methods have been used with parallel multilevel methods =-=[26, 46, 54, 55]-=-, as described in x3.5. Watts, et al. [58, 59] propose using a second-order implicit finite discretization of Eq. (3.1) to compute work transfers. This scheme converges to global balance in fewer iter... |

30 | Computational results for parallel unstructured mesh computations
- Jones, Plassman
- 1994
(Show Context)
Citation Context ... equal. By adjusting the partition sizes appropriately, any number of equally-sized sets can be created. 6 Hendrickson and Devine This freedom to adjust set sizes was exploited by Jones and Plassmann =-=[29]-=- to improve the basic algorithm in their work on adaptive mesh refinement. They call their approach Unbalanced Recursive Bisection, or URB. The basic RCB algorithm divides the set of objects into halv... |

29 |
Problem decomposition for adaptive hp finite element methods
- Patra, Oden
- 1995
(Show Context)
Citation Context ... or Space-Filling Curve (SFC) Partitioning. This approach was first used by Warren and Salmon for gravitational simulations [57]. Patra and Oden were the first to apply it to adaptive mesh refinement =-=[38, 40]-=-. Pilkington and Baden used this approach for smoothed particle hydrodynamics and reported results similar to using Recursive Coordinate Bisection [41]. For applications that don't already have an oct... |

28 | Load balancing strategies for distributed memory machines
- Diekmann, Monien, et al.
- 1997
(Show Context)
Citation Context ... its own work, thus using up-to-date information in the transfer and reducing the effects of aging. Another diffusion-like algorithm is dimensional exchange, introduced in [9] and analyzed further in =-=[13, 62, 63]-=-. A hypercube architecture is assumed to describe the algorithm. In a loop over hypercube dimensions i, a processor performs load balancing with its neighbor in that dimension, i.e., with the processo... |

23 |
A retrofit based methodology for the fast generation and optimization of large-scale mesh partitions: beyons the minimum interface size criterion
- Vanderstraeten, Farhat, et al.
- 1994
(Show Context)
Citation Context ...ver, criteria other than subdomain interface size become important. For domain decomposition linear solvers, for example, the aspect ratio of the subdomains affects the convergence of the solvers. In =-=[12, 53]-=-, the cost function to be minimized is a weighted combination of the load imbalance and the subdomain aspect ratio. Thus, objects whose coordinates are farthest from the average coordinates of all the... |

22 | A common data management infrastructure for adaptive algorithms for PDE solutions
- Parashar, Browne, et al.
- 1997
(Show Context)
Citation Context ...rformance in the computation. A global numbering also simplifies tools which automate a translation from a global numbering scheme to a per-processor scheme. This can simplify application development =-=[17, 39]-=-. To summarize, the runtime and quality of Space Filling Curve Partitioning are roughly comparable to the simple geometric approaches described earlier. SFC is perhaps a bit faster, but a bit lower qu... |

21 | Parallel adaptive hp-refinement techniques for conservation laws
- Devine, Flaherty
- 1996
(Show Context)
Citation Context ...hen a geometric region is divided, each of the eight octants becomes a child of the vertex representing the region. This data structure is widely used for mesh generation and adaptive mesh refinement =-=[2, 11, 36, 47]-=-. Octrees can also be used for partitioning. A traversal of the tree defines a global ordering on the tree leaves, which correspond to individual objects. This ordered list can then be sliced to gener... |

20 | Transient Dynamics Simulations: Parallel Algorithms for Contact Detection and Smoothed Particle Hydrodynamics
- Plimpton, Attaway, et al.
- 1996
(Show Context)
Citation Context ...g advantages relative to RIB. First, with RCB and URB the geometric regions owned by a processor are simple rectangular parallelepipeds. This geometric simplicity can be very useful. For instance, in =-=[42]-=- it is used to speed up the determination of which processors' regions intersect an extended object. Second, and more universally, RCB and URB partitions are incremental. If the objects move a small a... |

20 |
Multilevel diffusion algorithms for repartitioning of adaptive meshes
- Schloegel, Karypis, et al.
- 1997
(Show Context)
Citation Context ...pattern; see [9, 14] for details. Several methods have been proposed to accelerate the convergence of diffusion methods. For example, diffusion methods have been used with parallel multilevel methods =-=[26, 46, 54, 55]-=-, as described in x3.5. Watts, et al. [58, 59] propose using a second-order implicit finite discretization of Eq. (3.1) to compute work transfers. This scheme converges to global balance in fewer iter... |

19 |
Distributed load balancing: design and performance analysis
- Leiss, Reddy
- 1989
(Show Context)
Citation Context ...e diffusion model. It can increase the size of the maximum work transfer, but can reduce the total number of work transfers per iteration. There are several implementations of the demand-driven model =-=[18, 34, 60, 61, 62]-=-. For example, Leiss and Reddy [34] use a demand-driven model in neighborhoods that follow the hardware connectivity of the parallel machine. Wheat, et al. [61] extend their definition of a neighborho... |

17 |
Robust geometrically based, automatic two-dimensional mesh generation
- Baehmann, Wittchen, et al.
- 1987
(Show Context)
Citation Context |

15 | Driessche. Skewed graph partitioning
- Hendrickson, Leland, et al.
- 1997
(Show Context)
Citation Context ...ithm has no mechanism for encouraging a new partition to be an incremental modification of the current one. In [52], Van Driessche and Roose show how RSB can be modified to include incrementality. In =-=[25]-=-, this insight is generalized to apply to a class of graph partitioning algorithms, including the multilevel methods described below. The most popular static partitioning algorithms are multilevel tec... |

14 |
A Parallel Infrastructure for Scalable Adaptive Finite Element Methods and Its Application to Least Squares C ∞ Collocation
- Edwards
- 1997
(Show Context)
Citation Context ...rformance in the computation. A global numbering also simplifies tools which automate a translation from a global numbering scheme to a per-processor scheme. This can simplify application development =-=[17, 39]-=-. To summarize, the runtime and quality of Space Filling Curve Partitioning are roughly comparable to the simple geometric approaches described earlier. SFC is perhaps a bit faster, but a bit lower qu... |

14 | Parallel dynamic graph partitioning for unstructured meshes
- Walshaw, Cross, et al.
- 1997
(Show Context)
Citation Context ... more heavily loaded processors to complete their computation before performing the load balancing. Local methods can also be executed synchronously, following the same model as global methods (e.g., =-=[10, 11, 56, 61]-=-). Some local methods, however, can be performed asynchronously. Processors can initiate load balancing when they become idle, requesting work as they need it. Single neighborhoods may perform load ba... |

13 | Multithreaded model for the dynamic load-balancing of parallel adaptive pde computations
- Chrisochoides
- 1996
(Show Context)
Citation Context ...ifficult to program. Logic must be included to handle interrupts or check for load-balancing messages during the computation. Some implementations use threads to implement asynchronous load-balancing =-=[8, 59, 63]-=-; these implementations may require operating system support and be less portable than synchronous implementations. Other implementations include message checking within the application, complicating ... |

12 | Engineering diffusive load balancing algorithms using experiments
- Diekmann, Muthukrishnan, et al.
- 1997
(Show Context)
Citation Context ...mma P j ff ijs0 for every i. The choice of ff ij affects the convergence rate of the method. It depends on the processor connectivity due to the architecture or application communication pattern; see =-=[9, 14]-=- for details. Several methods have been proposed to accelerate the convergence of diffusion methods. For example, diffusion methods have been used with parallel multilevel methods [26, 46, 54, 55], as... |

11 |
Parallel branch-andbound methods for mixed-integer programming on the CM-5
- Eckstein
- 1994
(Show Context)
Citation Context ...ighly effective. Examples include Monte Carlo calculations [1], ray tracing [21], visualization [51], and parameter studies in which a sequential code needs to be run repeatedly with different inputs =-=[15]-=-. 3.2. Simple Geometric. Most mechanics calculations have an underlying geometry. And for many physical simulations, objects (e.g., mesh points, particles, etc.) interact only if they are geometricall... |

11 | A load balancing technique for multiphase computations
- Watts, Rieffel, et al.
- 1997
(Show Context)
Citation Context ...ifficult to program. Logic must be included to handle interrupts or check for load-balancing messages during the computation. Some implementations use threads to implement asynchronous load-balancing =-=[8, 59, 63]-=-; these implementations may require operating system support and be less portable than synchronous implementations. Other implementations include message checking within the application, complicating ... |

11 | Experience with automatic, dynamic load balancing and Adaptive Finite Element Computation
- Wheat, Devine, et al.
- 1994
(Show Context)
Citation Context ... more heavily loaded processors to complete their computation before performing the load balancing. Local methods can also be executed synchronously, following the same model as global methods (e.g., =-=[10, 11, 56, 61]-=-). Some local methods, however, can be performed asynchronously. Processors can initiate load balancing when they become idle, requesting work as they need it. Single neighborhoods may perform load ba... |

10 |
A fast spectral partitioner
- Simon, Sohn, et al.
- 1995
(Show Context)
Citation Context .... 3.6. Hybrid Methods. Several dynamic load balancing algorithms don't conveniently fall into any of the previous sections. One such method is the dynamic spectral algorithm of Simon, Sohn and Biswas =-=[49]-=-. This approach is a hybrid of spectral methods and simple geometric techniques. As a preprocessing step, a few eigenvectors of a matrix associated with the graph are computed. These k eigenvectors pr... |

8 | Partitioning meshes with lines and planes
- Cao, Gilbert, et al.
- 1996
(Show Context)
Citation Context ...tively, the use of a line or plane to cut the mesh should help keep the amount of communication small, at least for well shaped meshes. This intuition has been proved correct by Cao, Gilbert and Teng =-=[7]-=-, as long as the ratio between the sizes of the largest and smallest mesh elements is bounded. Several partitioning algorithms have been proposed that exploit this idea of recursively dividing the dom... |

8 |
Dynamic load balancing with a spectral bisection algorithm for the constrained graph partitioning problem
- Driessche, Roose
- 1995
(Show Context)
Citation Context ...tric methods and the local improvement schemes discussed above. Also, the basic RSB algorithm has no mechanism for encouraging a new partition to be an incremental modification of the current one. In =-=[52]-=-, Van Driessche and Roose show how RSB can be modified to include incrementality. In [25], this insight is generalized to apply to a class of graph partitioning algorithms, including the multilevel me... |

7 |
A radar simulation program for a 1024-processor hypercube
- Gustafson, Benner, et al.
- 1989
(Show Context)
Citation Context ...small. These assumptions are not satisfied by most scientific computations. But when they are, master/slave approaches are highly effective. Examples include Monte Carlo calculations [1], ray tracing =-=[21]-=-, visualization [51], and parameter studies in which a sequential code needs to be run repeatedly with different inputs [15]. 3.2. Simple Geometric. Most mechanics calculations have an underlying geom... |

6 |
A Fine Grained Data Migration Approach to Application Load Balancing on MP MIMD Machines
- Wheat
- 1992
(Show Context)
Citation Context ...e diffusion model. It can increase the size of the maximum work transfer, but can reduce the total number of work transfers per iteration. There are several implementations of the demand-driven model =-=[18, 34, 60, 61, 62]-=-. For example, Leiss and Reddy [34] use a demand-driven model in neighborhoods that follow the hardware connectivity of the parallel machine. Wheat, et al. [61] extend their definition of a neighborho... |

6 | Decentralized Remapping of Data Parallel Applications
- Xu, Lau, et al.
- 1997
(Show Context)
Citation Context ...ectures such as meshes. However, communication for non-hypercube architectures is non-local, as logical neighbors will not necessarily be physical neighbors. The generalized dimension exchange method =-=[63]-=- suggests using an edge-coloring to maintain nearest-neighbor exchanges in non-hypercube architectures; however, it requires more iterations to reach convergence. More importantly, dimensional exchang... |

5 |
Partitioning with spacefilling curves, CSE
- Pilkington, Baden
- 1994
(Show Context)
Citation Context ...irst to apply it to adaptive mesh refinement [38, 40]. Pilkington and Baden used this approach for smoothed particle hydrodynamics and reported results similar to using Recursive Coordinate Bisection =-=[41]-=-. For applications that don't already have an octree, a binning algorithm based on coordinate information can be used to build an octree for the load balancer. Each processor stores a part of the glob... |

4 |
Dynamic load balancing
- Enbody, Purdy, et al.
- 1995
(Show Context)
Citation Context ...equire operating system support and be less portable than synchronous implementations. Other implementations include message checking within the application, complicating the logic of the application =-=[18, 62]-=-. Since many parallel scientific applications have natural synchronization points, synchronous algorithms can often be used successfully, without the added complication of asynchronous algorithms. Man... |