Results 11 - 20
of
56
On Reducing TLB Misses in Matrix Multiplication
, 2002
"... During the last decade, a number of projects have pursued the high-performance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner kernel," C = A^T B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other opera ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
During the last decade, a number of projects have pursued the high-performance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner kernel," C = A^T B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other operands through that cache. Variants include approaches that extend this principle to multiple levels of cache or that apply the same principle to the L2 cache while essentially ignoring the L1 cache. The intent is to optimally amortize the cost of moving data between memory layers.
The approach proposed in this paper is fundamentally different. We start by observing that for current generation architectures, much of the overhead comes from Translation Look-aside Buffer (TLB) table misses. While the importance of caches is also taken into consideration, it is the minimization of such TLB misses that drives the approach. The result is a novel approach that achieves highly competitive performance on a broad spectrum of current high-performance architectures.
A Web Computing Environment for the SLICOT Library
, 2001
"... A prototype web computing environment for computations related to the design and analysis of control systems using the SLICOT software library is presented. The web interface can be accessed from a standard world wide web browser with no need for additional software installations on the local machin ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
A prototype web computing environment for computations related to the design and analysis of control systems using the SLICOT software library is presented. The web interface can be accessed from a standard world wide web browser with no need for additional software installations on the local machine. The environment provides user-friendly access to SLICOT routines where run-time options are specified by mouse clicks on appropriate buttons. Input data can be entered directly into the web interface by the user or uploaded from a local computer in a standard text format or in Matlab binary format. Output data is presented in the web browser window and possible to download in a number of different formats, including Matlab binary. The environment is ideal for testing the SLICOT software before performing a software installation or for performing a limited number of computations. It is also highly recommended for education as it is easy to use, and basically self-explanatory, with the users' guide integrated in the user interface.
Parallel Solvers for Sylvester-type Matrix Equations with Applications in Condition Estimation, Part I: Theory and Algorithms
, 2007
"... Parallel ScaLAPACK-style algorithms for solving eight common standard and generalized Sylvester-type matrix equations and various sign and transposed variants are presented. All algorithms are blocked variants based on the Bartels–Stewart method and involve four major steps: reduction to triangular ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Parallel ScaLAPACK-style algorithms for solving eight common standard and generalized Sylvester-type matrix equations and various sign and transposed variants are presented. All algorithms are blocked variants based on the Bartels–Stewart method and involve four major steps: reduction to triangular form, updating the right hand side with respect to the reduction, computing the solution to the reduced triangular problem and transforming the solution back to the original coordinate system. Novel parallel algorithms for solving reduced triangular matrix equations based on wavefront-like traversal of the right hand side matrices are presented together with a generic scalability analysis. These algorithms are used in condition estimation and new robust parallel sep−1-estimators are developed. Experimental results from three parallel platforms are presented and analyzed using several performance and accuracy metrics. The analysis includes results regarding general and triangular parallel solvers as well as parallel condition estimators.
A family of high-performance matrix multiplication algorithms
- INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCES
, 2001
"... During the last half-decade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software endproducts of both projects employ brute force to search a parame ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
During the last half-decade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software endproducts of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach: using a simple model of hierarchical memories we employ mathematics to determine a locally-optimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locally-optimal. Rather than determining a blocking strategy at library generation time, the theoretical results show that, ideally, one should pursue a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the operands. When the resulting family of algorithms is combined with a highly optimized inner-kernel for a small matrix multiplication, the approach yields performance that is superior to that of methods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) III processor, support the theoretical insights.
A unified model for multicore architectures
- In Proc. 1st International Forum on Next-Generation Multicore/Manycore Technologies
, 2008
"... With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of caches at different levels. We propose a unified memory hierarchy model that addresses these limitations and is an extension of the MHG model developed for a single processor with multi-memory hierarchy. We demonstrate that our unified framework can be applied to a number of multicore architectures for a variety of applications. In particular, we derive lower bounds on memory traffic between different levels in the hierarchy for financial and scientific computations. We also give a multicore algorithms for a financial
A Super-Programming Technique for Large Sparse Matrix Multiplication on PC Clusters
- on PC clusters, IEICE Trans. Info. Systems E87-D
, 2004
"... The multiplication of large spare matrices is a basic operation for many scientific and engineering applications. There exist some high-performance library routines for this operation. They are often optimized based on the target architecture. The PC cluster computing paradigm has recently emerged a ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
The multiplication of large spare matrices is a basic operation for many scientific and engineering applications. There exist some high-performance library routines for this operation. They are often optimized based on the target architecture. The PC cluster computing paradigm has recently emerged as a viable alternative for high-performance, low-cost computing. In this paper, we apply our super-programming approach [24] to study the load balance and runtime management overhead for implementing parallel large matrix multiplication on PC clusters. For a parallel environment, it is essential to partition the entire operation into tasks and assign them to individual processing elements. Most of the existing approaches partition the given sub-matrices based on some kinds of workload estimation. For dense matrices on some architectures estimations may be accurate. For sparse matrices on PC, however, the workloads of block operations may not necessarily depend on the size of data. The workloads may not be well estimated in advance. Any approach other than run-time dynamic partitioning may degrade performance. Moreover, in a heterogeneous environment, statically partitioning is NP-complete. For embedded problems, it also introduces management overhead. In this paper We adopt our super-programming approach that partitions the entire task into medium-grain tasks that are implemented using super-instructions; the workload of super-instructions is easy to estimate. These tasks are dynamically assigned to member computer nodes. A node may execute more than one super-instruction. Our results prove the viability of our approach.
Combining Explicit and Recursive Blocking for Solving Triangular Sylvester-Type Matrix Equations on Distributed Memory Platforms
- In M. Danelutto, D. Laforenza, M. Vanneschi (EDS.): Euro-Par 2004, Lecture Notes in Computer Science
, 2004
"... Abstract. Parallel ScaLAPACK-style hybrid algorithms for solving the triangular continuous-time Sylvester (SYCT) equation AX − XB = C using recursive blocked node solvers from the novel high-performance library RECSY are presented. We compare our new hybrid algorithms with parallel implementations b ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. Parallel ScaLAPACK-style hybrid algorithms for solving the triangular continuous-time Sylvester (SYCT) equation AX − XB = C using recursive blocked node solvers from the novel high-performance library RECSY are presented. We compare our new hybrid algorithms with parallel implementations based on the SYCT solver DTRSYL from LAPACK. Experiments show that the RECSY solvers can significantly improve on the serial as well as on the parallel performance if the problem data is partitioned and distributed in an appropriate way. Examples include cutting down the execution time by 47 % and 34 % when solving large-scale problems using two different communication schemes in the parallel algorithm and distributing the matrices with blocking factors four times larger than normally. The recursive blocking is automatic for solving subsystems of the global explicit blocked algorithm on the nodes. Keywords: Sylvester matrix equation, continuous-time, Bartels–Stewart
Evaluating Parallel Algorithms for Solving Sylvester-Type Matrix Equations: Direct Transformation-Based versus Iterative Matrix-Sign-Function-Based Methods
- PARA 2004 - Applied Parallel Computing. State of the Art in Scientific Computing
, 2004
"... Recent ScaLAPACK-style implementations of the Bartels-Stewart method and the iterative matrix-sign-function-based method for solving continuous-time Sylvester matrix equations are evaluated with respect to generality of use, execution time and accuracy of computed results. The test problems includ ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Recent ScaLAPACK-style implementations of the Bartels-Stewart method and the iterative matrix-sign-function-based method for solving continuous-time Sylvester matrix equations are evaluated with respect to generality of use, execution time and accuracy of computed results. The test problems include well-conditioned as well as illconditioned Sylvester equations. A method is considered more general if it can effectively solve a larger set of problems. Ill-conditioning is measured with respect to the separation of the two matrices in the Sylvester operator. Experiments carried out on two different distributed memory machines show that the parallel explicitly blocked Bartels-Stewart algorithm can solve more general problems and delivers far more accuracy for ill-conditioned problems. It is also up to four times faster for large enough problems on the most balanced parallel platform (IBM SP), while the parallel iterative algorithm is almost always the fastest of the two on the less balanced platform (HPC2N Linux Super Cluster).
Towards an Accurate Performance Modeling of Parallel Sparse
- LU Factorization, in "Applicable Algebra in Engineering, Communication, and Computing
, 2006
"... We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model ch ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We present a simulation-based performance model to analyze a parallel sparse LU factorization algorithm on modern cached-based, high-end parallel architectures. We consider supernodal right-looking parallel factorization on a bi-dimensional grid of processors, that uses static pivoting. Our model characterizes the algorithmic behavior by taking into account the underlying processor speed, memory system performance, as well as the interconnect speed. The model is validated using the implementation in the SuperLU DIST linear system solver, the sparse matrices from real application, and an IBM POWER3 parallel machine. Our modeling methodology can be adapted to study performance of other types of sparse factorizations, such as Cholesky or QR, and on different parallel machines. 1
Representing dense linear algebra algorithms: A farewell to indices. FLAME Working Note #17
, 2006
"... We present a notation that allows a dense linear algebra algorithm to be represented in a way that is visually recognizable. The primary value of the notation is that it exposes subvectors and submatrices allowing the details of the algorithm to be the focus while hiding the intricate indices relate ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We present a notation that allows a dense linear algebra algorithm to be represented in a way that is visually recognizable. The primary value of the notation is that it exposes subvectors and submatrices allowing the details of the algorithm to be the focus while hiding the intricate indices related to the arrays in which the vectors and matrices are stored. The applicability of the notation is illustrated through a succession of progressively complex case studies ranging from matrix-vector operations to the chasing of the bulge of the symmetric QR iteration. The notation facilitates comparing and contrasting different algorithms for the same operation as well as similar algorithms for different operations. Finally, we point out how algorithms represented with this notation can be directly translated into high-performance code.

