Results 1  10
of
23
Spectral Partitioning Works: Planar graphs and finite element meshes
 In IEEE Symposium on Foundations of Computer Science
, 1996
"... Spectral partitioning methods use the Fiedler vectorthe eigenvector of the secondsmallest eigenvalue of the Laplacian matrixto find a small separator of a graph. These methods are important components of many scientific numerical algorithms and have been demonstrated by experiment to work extr ..."
Abstract

Cited by 144 (8 self)
 Add to MetaCart
Spectral partitioning methods use the Fiedler vectorthe eigenvector of the secondsmallest eigenvalue of the Laplacian matrixto find a small separator of a graph. These methods are important components of many scientific numerical algorithms and have been demonstrated by experiment to work extremely well. In this paper, we show that spectral partitioning methods work well on boundeddegree planar graphs and finite element meshes the classes of graphs to which they are usually applied. While naive spectral bisection does not necessarily work, we prove that spectral partitioning techniques can be used to produce separators whose ratio of vertices removed to edges cut is O( p n) for boundeddegree planar graphs and twodimensional meshes and O i n 1=d j for wellshaped ddimensional meshes. The heart of our analysis is an upper bound on the secondsmallest eigenvalues of the Laplacian matrices of these graphs. 1. Introduction Spectral partitioning has become one of the mos...
A New Parallel KernelIndependent Fast Multipole Method
 in SC2003
"... We present a new adaptive fast multipole algorithm and its parallel implementation. The algorithm is kernelindependent in the sense that the evaluation of pairwise interactions does not rely on any analytic expansions, but only utilizes kernel evaluations. The new method provides the enabling techn ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
We present a new adaptive fast multipole algorithm and its parallel implementation. The algorithm is kernelindependent in the sense that the evaluation of pairwise interactions does not rely on any analytic expansions, but only utilizes kernel evaluations. The new method provides the enabling technology for many important problems in computational science and engineering. Examples include viscous flows, fracture mechanics and screened Coulombic interactions. Our MPIbased parallel implementation logically separates the computation and communication phases to avoid synchronization in the upward and downward computation passes, and thus allows us to fully exploit computation and communication overlapping. We measure isogranular and fixedsize scalability for a variety of kernels on the Pittsburgh Supercomputing Center's TCS1 Alphaserver on up to 3000 processors. We have solved viscous flow problems with up to 2.1 billion unknowns and we have achieved 1.6 Tflops/s peak performance and 1.13 Tflops/s sustained performance.
Graph partitioning and continuous quadratic programming
 SIAM J. Discrete Math
, 1999
"... Abstract. A continuous quadratic programming formulation is given for mincut graph partitioning problems. In these problems, we partition the vertices of a graph into a collection of disjoint sets satisfying specified size constraints, while minimizing the sum of weights of edges connecting vertice ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
Abstract. A continuous quadratic programming formulation is given for mincut graph partitioning problems. In these problems, we partition the vertices of a graph into a collection of disjoint sets satisfying specified size constraints, while minimizing the sum of weights of edges connecting vertices in different sets. An optimal solution is related to an eigenvector (Fiedler vector) corresponding to the second smallest eigenvalue of the graph’s Laplacian. Necessary and sufficient conditions characterizing local minima of the quadratic program are given. The effect of diagonal perturbations on the number of local minimizers is investigated using a test problem from the literature.
Fast Multipole Methods on Graphical Processors
 Journal of Computational Physics
"... The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), whic ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
The Fast Multipole Method allows the rapid evaluation of sums of radial basis functions centered at points distributed inside a computational domain at a large number of evaluation points to a specified accuracy ɛ. The method scales as O (N) compared to the direct method with complexity O(N 2), which allows one to solve larger scale problems. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on dataparallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architectures, and determined optimal settings for the FMM on the GPU, which are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range 3060 compared to a serial CPU implementation for benchmark cases of up to million size. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at 2443 Teraflop rate if we use straightforward summation. 1
Dynamic compressed hyperoctrees with application to the Nbody problem
 In Proc. 19th Conf
, 1999
"... Abstract. Hyperoctree is a popular data structure for organizing multidimensional point data. The main drawback of this data structure is that its size and the runtime of operations supported by it are dependent upon the distribution of the points. Clarkson rectified the distributiondependency in t ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Abstract. Hyperoctree is a popular data structure for organizing multidimensional point data. The main drawback of this data structure is that its size and the runtime of operations supported by it are dependent upon the distribution of the points. Clarkson rectified the distributiondependency in the size of hyperoctrees by introducing compressed hyperoctrees. He presents an O(n log n) expected time randomized algorithm to construct a compressed hyperoctree. In this paper, we give three deterministic algorithms to construct a compressed hyperoctree in O(n log n) time, for any fixed dimension d. We present O(log n) algorithms for point and cubic region searches, point insertions and deletions. We propose a solution to the Nbody problem in O(n) time, given the tree. Our algorithms also reduce the runtime dependency on the number of dimensions. 1
On the Quality of Partitions based on SpaceFilling Curves
, 2002
"... This paper presents bounds on the quality of partitions induced by spacefilling curves. We compare the surface that surrounds an arbitrary index range with the optimal partition in the grid, i. e. the square. It is shown that partitions induced by Lebesgue and Hilbert curves behave about 1.85 times ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
This paper presents bounds on the quality of partitions induced by spacefilling curves. We compare the surface that surrounds an arbitrary index range with the optimal partition in the grid, i. e. the square. It is shown that partitions induced by Lebesgue and Hilbert curves behave about 1.85 times worse with respect to the length of the surface. The Lebesgue indexing gives better results than the Hilbert indexing in worst case analysis. Furthermore, the surface of partitions based on the Lebesgue indexing are at most 3 times larger than the optimal in average case.
MinMaxBoundary Domain Decomposition
 Theor. Comput. Sci
, 1998
"... Domain decomposition is one of the most effective and popular parallel computing techniques for solving large scale numerical systems. In the special case when the amount of computation in a subdomain is proportional to the volume of the subdomain, domain decomposition amounts to minimizing the surf ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Domain decomposition is one of the most effective and popular parallel computing techniques for solving large scale numerical systems. In the special case when the amount of computation in a subdomain is proportional to the volume of the subdomain, domain decomposition amounts to minimizing the surface area of each subdomain while dividing the volume evenly. Motivated by this fact, we study the following minmax boundary multiway partitioning problem: Given a graph G and an integer k ? 1, we would like to divide G into k subgraphs G 1 ; : : : ; G k (by removing edges) such that (i) jG i j = \Theta(jGj=k) for all i 2 f1; : : : ; kg; and (ii) the maximum boundary size of any subgraph (the set of edges connecting it with other subgraphs) is minimized. We provide an algorithm that given G, a wellshaped mesh in d dimensions, finds a partition of G into k subgraphs G 1 ; : : : ; G k , such that for all i, G i has \Theta(jGj=k) vertices and the number of edges connecting G i with the ot...
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures
"... We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPUGPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideandconquer algorithm that per ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPUGPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideandconquer algorithm that performs a fast Nbody sum using a spatial decomposition and is often used in a timestepping or iterative loop. Using the observation that the local summation and the analysisbased translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiplenode versions. Our implementation can perform the Nbody sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters. ACM computing classification: C.1.2 [Multiple Data Stream Architectures]:Parallel processors; C.1.m [Miscellaneous]:
Average Case Quality of Partitions Induced by the Lebesgue Indexing
, 2001
"... This paper presents the quality of partitions induced by the Lebesgue curve in average case. The surface that surrounds an arbitrary index range is compared with the optimal partition in the grid, i. e. the square. The upper bound on the surface is asymptotically 3 times the optimal size. ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper presents the quality of partitions induced by the Lebesgue curve in average case. The surface that surrounds an arbitrary index range is compared with the optimal partition in the grid, i. e. the square. The upper bound on the surface is asymptotically 3 times the optimal size.
Parallel Software for Inductance Extraction
"... The next generation VLSI circuits will be designed with millions of densely packed interconnect segments on a single chip. Inductive effects between these segments begin to dominate signal delay as the clock frequency is increased. Modern parasitic extraction tools to estimate the onchip inductive ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
The next generation VLSI circuits will be designed with millions of densely packed interconnect segments on a single chip. Inductive effects between these segments begin to dominate signal delay as the clock frequency is increased. Modern parasitic extraction tools to estimate the onchip inductive effects with high accuracy have had limited impact due to large computational and storage requirements. This paper describes a parallel software package for inductance extraction called ParIS, which is capable of analyzing interconnect configurations involving several conductors within reasonable time. The main component of the software is a novel preconditioned iterative method that is used to solve a dense complex linear system of equations. The linear system represents the inductive coupling between filaments that are used to discretize the conductors. A variant of the Fast Multipole Method is used to compute dense matrixvector products with the coefficient matrix. ParIS uses a twotier parallel formulation that allows mixed mode parallelization using both MPI and OpenMP. An MPI process is associated with each conductor. The computation within a conductor is parallelized using OpenMP. The parallel efficiency and scalability of the software is demonstrated through experiments on the IBM p690 and Intel and AMD Linux clusters. These experiments highlight the portability and efficiency of the software on multiprocessors with shared, distributed, and distributedshared memory architectures.