Results 11 -
18 of
18
Parallel Software for Inductance Extraction
"... The next generation VLSI circuits will be designed with millions of densely packed interconnect segments on a single chip. Inductive effects between these segments begin to dominate signal delay as the clock frequency is increased. Modern parasitic extraction tools to estimate the on-chip inductive ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The next generation VLSI circuits will be designed with millions of densely packed interconnect segments on a single chip. Inductive effects between these segments begin to dominate signal delay as the clock frequency is increased. Modern parasitic extraction tools to estimate the on-chip inductive effects with high accuracy have had limited impact due to large computational and storage requirements. This paper describes a parallel software package for inductance extraction called ParIS, which is capable of analyzing interconnect configurations involving several conductors within reasonable time. The main component of the software is a novel preconditioned iterative method that is used to solve a dense complex linear system of equations. The linear system represents the inductive coupling between filaments that are used to discretize the conductors. A variant of the Fast Multipole Method is used to compute dense matrix-vector products with the coefficient matrix. ParIS uses a two-tier parallel formulation that allows mixed mode parallelization using both MPI and OpenMP. An MPI process is associated with each conductor. The computation within a conductor is parallelized using OpenMP. The parallel efficiency and scalability of the software is demonstrated through experiments on the IBM p690 and Intel and AMD Linux clusters. These experiments highlight the portability and efficiency of the software on multiprocessors with shared, distributed, and distributed-shared memory architectures.
Definition of a New Circular Space-Filling Curve βΩ-Indexing
, 2002
"... This technical report presents the definition of a circular Hilbert-like space-filling curve. Preliminary evaluations in a simulation environment have shown good locality preserving properties. The results are compared with known bounds for other indexing schemes: Hilbert-, Lebesgue-, and H-Indexing ..."
Abstract
- Add to MetaCart
This technical report presents the definition of a circular Hilbert-like space-filling curve. Preliminary evaluations in a simulation environment have shown good locality preserving properties. The results are compared with known bounds for other indexing schemes: Hilbert-, Lebesgue-, and H-Indexing. We evaluated partitions induced by the indexing schemes and uses the diameter and the surface as measures. For both we present worst case and average case results.
SIAM J. SCI. COMPUT. Vol. 30, No. 5, pp. 2675–2708 c ○ 2008 Society for Industrial and Applied Mathematics BOTTOM-UP CONSTRUCTION AND 2:1 BALANCE REFINEMENT OF LINEAR OCTREES IN PARALLEL ∗
"... Abstract. In this article, we propose new parallel algorithms for the construction and 2:1 balance refinement of large linear octrees on distributed memory machines. Such octrees are used in many problems in computational science and engineering, e.g., object representation, image analysis, unstruct ..."
Abstract
- Add to MetaCart
Abstract. In this article, we propose new parallel algorithms for the construction and 2:1 balance refinement of large linear octrees on distributed memory machines. Such octrees are used in many problems in computational science and engineering, e.g., object representation, image analysis, unstructured meshing, finite elements, adaptive mesh refinement, and N-body simulations. Fixed-size scalability and isogranular analysis of the algorithms using an MPI-based parallel implementation was performed on a variety of input data and demonstrated good scalability for different processor counts (1 to 1024 processors) on the Pittsburgh Supercomputing Center’s TCS-1 AlphaServer. The results are consistent for different data distributions. Octrees with over a billion octants were constructed and balanced in less than a minute on 1024 processors. Like other existing algorithms for constructing and balancing octrees, our algorithms have O(N log N) work and O(N) storage complexity. Under reasonable assumptions on the distribution of octants and the work per octant, the parallel time complexity is O ( N np number of processors. log( N np)+np log np), where N is the size of the final linear octree and np is the
Contents lists available at ScienceDirect Journal of Computational Physics
"... journal homepage: www.elsevier.com/locate/jcp ..."
Mathematical and Numerical Aspects of the Adaptive Fast Multipole Poisson-Boltzmann Solver
"... Abstract. This paper summarizes the mathematical and numerical theories and computational elements of the adaptive fast multipole Poisson-Boltzmann (AFMPB) solver. We introduce and discuss the following components in order: the Poisson-Boltzmann model, boundary integral equation reformulation, surfa ..."
Abstract
- Add to MetaCart
Abstract. This paper summarizes the mathematical and numerical theories and computational elements of the adaptive fast multipole Poisson-Boltzmann (AFMPB) solver. We introduce and discuss the following components in order: the Poisson-Boltzmann model, boundary integral equation reformulation, surface mesh generation, the node-patch discretization approach, Krylov iterative methods, the new version of fast multipole methods (FMMs), and a dynamic prioritization technique for scheduling parallel operations. For each component, we also remark on feasible approaches for further improvements in efficiency, accuracy and applicability of the AFMPB solver to large-scale long-time molecular dynamics simulations. The potential of the solver is demonstrated with preliminary numerical results.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction ⋆
"... Abstract. Parasitic extraction techniques are used to estimate signal delay in VLSI chips. Inductance extraction is a critical component of the parasitic extraction process in which on-chip inductive effects are estimated with high accuracy. In earlier work [1], we described a parallel software pack ..."
Abstract
- Add to MetaCart
Abstract. Parasitic extraction techniques are used to estimate signal delay in VLSI chips. Inductance extraction is a critical component of the parasitic extraction process in which on-chip inductive effects are estimated with high accuracy. In earlier work [1], we described a parallel software package for inductance extraction called ParIS, which uses a novel preconditioned iterative method to solve the dense, complex linear system of equations arising in these problems. The most computationally challenging task in ParIS involves computing dense matrix-vector products efficiently via hierarchical multipole-based approximation techniques. This paper presents a comparative study of two such techniques: a hierarchical algorithm called Hierarchical Multipole Method (HMM) and the well-known Fast Multipole Method (FMM). We investigate the performance of parallel MPI-based implementations of these algorithms on a Linux cluster. We analyze the impact of various algorithmic parameters and identify regimes where HMM is expected to outperform FMM on uniprocessor as well as multiprocessor platforms. 1
Scalable Fast Multipole Methods on Distributed Heterogeneous Architectures
"... We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideand-conquer algorithm that per ..."
Abstract
- Add to MetaCart
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divideand-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a timestepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters. ACM computing classification: C.1.2 [Multiple Data Stream Architectures]:Parallel processors; C.1.m [Miscellaneous]:

