Results 11  20
of
104
The complex step approximation to the Fréchet derivative of a matrix function
 NUMER ALGOR (2010) 53:133–148
, 2010
"... We show that the Fréchet derivative of a matrix function f at A in the direction E, whereA and E are real matrices, can be approximated by Im f (A + ihE)/h for some suitably small h. This approximation, requiring a single function evaluation at a complex argument, generalizes the complex step appr ..."
Abstract

Cited by 15 (7 self)
 Add to MetaCart
We show that the Fréchet derivative of a matrix function f at A in the direction E, whereA and E are real matrices, can be approximated by Im f (A + ihE)/h for some suitably small h. This approximation, requiring a single function evaluation at a complex argument, generalizes the complex step approximation known in the scalar case. The approximation is proved to be of second order in h for analytic functions f and also for the matrix sign function. It is shown that it does not suffer the inherent cancellation that limits the accuracy of finite difference approximations in floating point arithmetic. However, cancellation does nevertheless vitiate the approximation when the underlying method for evaluating f employs complex arithmetic. The ease of implementation of the approximation, and its superiority over finite differences, make it attractive when specialized methods for evaluating the Fréchet derivative are not available, and in particular for condition number estimation when used in conjunction with a block 1norm estimation algorithm.
A unified model for multicore architectures
 In Proc. 1st International Forum on NextGeneration Multicore/Manycore Technologies
, 2008
"... With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of caches at different levels. We propose a unified memory hierarchy model that addresses these limitations and is an extension of the MHG model developed for a single processor with multimemory hierarchy. We demonstrate that our unified framework can be applied to a number of multicore architectures for a variety of applications. In particular, we derive lower bounds on memory traffic between different levels in the hierarchy for financial and scientific computations. We also give a multicore algorithms for a financial
On Reducing TLB Misses in Matrix Multiplication
, 2002
"... During the last decade, a number of projects have pursued the highperformance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner kernel," C = A^T B + C, that keeps one of the operands in the L1 cache, while streaming parts of the o ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
During the last decade, a number of projects have pursued the highperformance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner kernel," C = A^T B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other operands through that cache. Variants include approaches that extend this principle to multiple levels of cache or that apply the same principle to the L2 cache while essentially ignoring the L1 cache. The intent is to optimally amortize the cost of moving data between memory layers.
The approach proposed in this paper is fundamentally different. We start by observing that for current generation architectures, much of the overhead comes from Translation Lookaside Buffer (TLB) table misses. While the importance of caches is also taken into consideration, it is the minimization of such TLB misses that drives the approach. The result is a novel approach that achieves highly competitive performance on a broad spectrum of current highperformance architectures.
Parallel Solvers for Sylvestertype Matrix Equations with Applications in Condition Estimation, Part I: Theory and Algorithms
, 2007
"... Parallel ScaLAPACKstyle algorithms for solving eight common standard and generalized Sylvestertype matrix equations and various sign and transposed variants are presented. All algorithms are blocked variants based on the Bartels–Stewart method and involve four major steps: reduction to triangular ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Parallel ScaLAPACKstyle algorithms for solving eight common standard and generalized Sylvestertype matrix equations and various sign and transposed variants are presented. All algorithms are blocked variants based on the Bartels–Stewart method and involve four major steps: reduction to triangular form, updating the right hand side with respect to the reduction, computing the solution to the reduced triangular problem and transforming the solution back to the original coordinate system. Novel parallel algorithms for solving reduced triangular matrix equations based on wavefrontlike traversal of the right hand side matrices are presented together with a generic scalability analysis. These algorithms are used in condition estimation and new robust parallel sep−1estimators are developed. Experimental results from three parallel platforms are presented and analyzed using several performance and accuracy metrics. The analysis includes results regarding general and triangular parallel solvers as well as parallel condition estimators.
Adaptive Winograd’s Matrix Multiplications
, 2008
"... Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of cou ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive StrassenWinograd’s matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in doubleprecision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent incomplexity subMMs and does not require any extra computation to combine the intermediary subMM results. We achieve up to 22 % executiontime reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19 % for a 2 dualcore processor system. Most importantly, even for small matrices such as 1500×1500, our approach attains already 10 % executiontime reduction and, for MM of matrices larger than 3000×3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to fasterthanprocessor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
Cacheoptimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have nearoptimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
Parallel ScaLAPACKstyle Algorithms for Solving ContinuousTime Sylvester Equations
 In EuroPar 2003 Parallel Processing, H. Kosch and et al, Eds. Lecture Notes in Computer Science
, 2003
"... Abstract. An implementation of a parallel ScaLAPACKstyle solver for the general Sylvester equation, op(A)X − Xop(B) = C, where op(A) denotes A or its transpose A T, is presented. The parallel algorithm is based on explicit blocking of the BartelsStewart method. An initial transformation of the co ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
(Show Context)
Abstract. An implementation of a parallel ScaLAPACKstyle solver for the general Sylvester equation, op(A)X − Xop(B) = C, where op(A) denotes A or its transpose A T, is presented. The parallel algorithm is based on explicit blocking of the BartelsStewart method. An initial transformation of the coefficient matrices A and B to Schur form leads to a reduced triangular matrix equation. We use different matrix traversing strategies to handle the transposes in the problem to solve, leading to different new parallel wavefront algorithms. We also present a strategy to handle the problem when 2 x 2 diagonal blocks of the matrices in Schur form, corresponding to complex conjugate pairs of eigenvalues, are split between several blocks in the block partitioned matrices. Finally, the solution of the reduced matrix equation is transformed back to the originally coordinate system. The implementation acts in a ScaLAPACK environment using 2dimensional block cyclic mapping of the matrices onto a rectangular grid of processes. Real performance results are presented which verify that our parallel algorithms are reliable and scalable. Keywords: Sylvester matrix equation, continuoustime, Bartels–Stewart
Hybrid mpi/openmp parallel linear support vector machine training
 JMLR
"... Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and Open ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Largescale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI.
A novel parallel QR algorithm for hybrid distributed memory HPC systems, Technical report 200915, Seminar for applied mathematics
, 2009
"... Abstract. A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing (HPC) systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early defla ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Abstract. A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing (HPC) systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early deflation. The multiwindow approach ensures that most computations when chasing chains of bulges are performed in level 3 BLAS operations, while the aim of aggressive early deflation is to speed up the convergence of the QR algorithm. Mixed MPIOpenMP coding techniques are utilized for porting the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous numerical experiments confirm the superior performance of our parallel QR algorithm in comparison with the existing ScaLAPACK code, leading to an implementation that is one to two orders of magnitude faster for sufficiently large problems, including a number of examples from applications.