Results 11  20
of
69
A unified model for multicore architectures
 In Proc. 1st International Forum on NextGeneration Multicore/Manycore Technologies
, 2008
"... With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of caches at different levels. We propose a unified memory hierarchy model that addresses these limitations and is an extension of the MHG model developed for a single processor with multimemory hierarchy. We demonstrate that our unified framework can be applied to a number of multicore architectures for a variety of applications. In particular, we derive lower bounds on memory traffic between different levels in the hierarchy for financial and scientific computations. We also give a multicore algorithms for a financial
Parallel ScaLAPACKstyle Algorithms for Solving ContinuousTime Sylvester Equations
 In EuroPar 2003 Parallel Processing, H. Kosch and et al, Eds. Lecture Notes in Computer Science
, 2003
"... Abstract. An implementation of a parallel ScaLAPACKstyle solver for the general Sylvester equation, op(A)X − Xop(B) = C, where op(A) denotes A or its transpose A T, is presented. The parallel algorithm is based on explicit blocking of the BartelsStewart method. An initial transformation of the co ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
Abstract. An implementation of a parallel ScaLAPACKstyle solver for the general Sylvester equation, op(A)X − Xop(B) = C, where op(A) denotes A or its transpose A T, is presented. The parallel algorithm is based on explicit blocking of the BartelsStewart method. An initial transformation of the coefficient matrices A and B to Schur form leads to a reduced triangular matrix equation. We use different matrix traversing strategies to handle the transposes in the problem to solve, leading to different new parallel wavefront algorithms. We also present a strategy to handle the problem when 2 x 2 diagonal blocks of the matrices in Schur form, corresponding to complex conjugate pairs of eigenvalues, are split between several blocks in the block partitioned matrices. Finally, the solution of the reduced matrix equation is transformed back to the originally coordinate system. The implementation acts in a ScaLAPACK environment using 2dimensional block cyclic mapping of the matrices onto a rectangular grid of processes. Real performance results are presented which verify that our parallel algorithms are reliable and scalable. Keywords: Sylvester matrix equation, continuoustime, Bartels–Stewart
On Reducing TLB Misses in Matrix Multiplication
, 2002
"... During the last decade, a number of projects have pursued the highperformance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner kernel," C = A^T B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other opera ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
During the last decade, a number of projects have pursued the highperformance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner kernel," C = A^T B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other operands through that cache. Variants include approaches that extend this principle to multiple levels of cache or that apply the same principle to the L2 cache while essentially ignoring the L1 cache. The intent is to optimally amortize the cost of moving data between memory layers.
The approach proposed in this paper is fundamentally different. We start by observing that for current generation architectures, much of the overhead comes from Translation Lookaside Buffer (TLB) table misses. While the importance of caches is also taken into consideration, it is the minimization of such TLB misses that drives the approach. The result is a novel approach that achieves highly competitive performance on a broad spectrum of current highperformance architectures.
A novel parallel QR algorithm for hybrid distributed memory HPC systems, Technical report 200915, Seminar for applied mathematics
, 2009
"... Abstract. A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing (HPC) systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early defla ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Abstract. A novel variant of the parallel QR algorithm for solving dense nonsymmetric eigenvalue problems on hybrid distributed high performance computing (HPC) systems is presented. For this purpose, we introduce the concept of multiwindow bulge chain chasing and parallelize aggressive early deflation. The multiwindow approach ensures that most computations when chasing chains of bulges are performed in level 3 BLAS operations, while the aim of aggressive early deflation is to speed up the convergence of the QR algorithm. Mixed MPIOpenMP coding techniques are utilized for porting the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous numerical experiments confirm the superior performance of our parallel QR algorithm in comparison with the existing ScaLAPACK code, leading to an implementation that is one to two orders of magnitude faster for sufficiently large problems, including a number of examples from applications.
A family of highperformance matrix multiplication algorithms
 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCES
, 2001
"... During the last halfdecade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software endproducts of both projects employ brute force to search a parame ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
During the last halfdecade, a number of research efforts have centered around developing software for generating automatically tuned matrix multiplication kernels. These include the PHiPAC project and the ATLAS project. The software endproducts of both projects employ brute force to search a parameter space for blockings that accommodate multiple levels of memory hierarchy. We take a different approach: using a simple model of hierarchical memories we employ mathematics to determine a locallyoptimal strategy for blocking matrices. The theoretical results show that, depending on the shape of the matrices involved, different strategies are locallyoptimal. Rather than determining a blocking strategy at library generation time, the theoretical results show that, ideally, one should pursue a heuristic that allows the blocking strategy to be determined dynamically at runtime as a function of the shapes of the operands. When the resulting family of algorithms is combined with a highly optimized innerkernel for a small matrix multiplication, the approach yields performance that is superior to that of methods that automatically tune such kernels. Preliminary results, for the Intel Pentium (R) III processor, support the theoretical insights.
Adaptive Winograd’s Matrix Multiplications
, 2008
"... Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of cou ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive StrassenWinograd’s matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in doubleprecision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent incomplexity subMMs and does not require any extra computation to combine the intermediary subMM results. We achieve up to 22 % executiontime reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19 % for a 2 dualcore processor system. Most importantly, even for small matrices such as 1500×1500, our approach attains already 10 % executiontime reduction and, for MM of matrices larger than 3000×3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to fasterthanprocessor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
A Web Computing Environment for the SLICOT Library
, 2001
"... A prototype web computing environment for computations related to the design and analysis of control systems using the SLICOT software library is presented. The web interface can be accessed from a standard world wide web browser with no need for additional software installations on the local machin ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
A prototype web computing environment for computations related to the design and analysis of control systems using the SLICOT software library is presented. The web interface can be accessed from a standard world wide web browser with no need for additional software installations on the local machine. The environment provides userfriendly access to SLICOT routines where runtime options are specified by mouse clicks on appropriate buttons. Input data can be entered directly into the web interface by the user or uploaded from a local computer in a standard text format or in Matlab binary format. Output data is presented in the web browser window and possible to download in a number of different formats, including Matlab binary. The environment is ideal for testing the SLICOT software before performing a software installation or for performing a limited number of computations. It is also highly recommended for education as it is easy to use, and basically selfexplanatory, with the users' guide integrated in the user interface.
Cacheoptimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have nearoptimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
Oijen, “Floatingpoint matrix multiplication in a polymorphic processor
 in Proc. IEEE International Conference on FieldProgrammable Technology (ICFPT’07
, 2007
"... Abstract—We consider 64bit floatingpoint matrix multiplication in the context of polymorphic processor architectures. Our proposal provides a complete and performance efficient solution of the matrix multiplication problem, including hardware design and software interface. We adopt previous ideas1 ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract—We consider 64bit floatingpoint matrix multiplication in the context of polymorphic processor architectures. Our proposal provides a complete and performance efficient solution of the matrix multiplication problem, including hardware design and software interface. We adopt previous ideas1, originally proposed for loosely coupled processors and message passing communications. We employ these ideas into a tightly coupled custom computing unit (CCU) in the Molen polymorphic processor. Furthermore, we introduce a controller, which facilitates the efficient operation of the multiplier processing elements (PEs) in a polymorphic environment. The design is evaluated theoretically and through real hardware experiments. More precisely, we fit 9 processing elements in an XC2VP30–6 device; this configuration suggests theoretical peak performance of 1.80 GFLOPS. In practice, we measured sustained performance of up to 1.79 GFLOPS for the matrix multiplication on real hardware, including the software overhead. Theoretical analysis and experimental results suggest that the design efficiency scales better for large problem sizes. Index Terms—Floatingpoint arithmetic, Matrix multiplication, Polymorphic processors, Reconfigurable hardware.
A SuperProgramming Technique for Large Sparse Matrix Multiplication on PC Clusters
 on PC clusters, IEICE Trans. Info. Systems E87D
, 2004
"... The multiplication of large spare matrices is a basic operation for many scientific and engineering applications. There exist some highperformance library routines for this operation. They are often optimized based on the target architecture. The PC cluster computing paradigm has recently emerged a ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
The multiplication of large spare matrices is a basic operation for many scientific and engineering applications. There exist some highperformance library routines for this operation. They are often optimized based on the target architecture. The PC cluster computing paradigm has recently emerged as a viable alternative for highperformance, lowcost computing. In this paper, we apply our superprogramming approach [24] to study the load balance and runtime management overhead for implementing parallel large matrix multiplication on PC clusters. For a parallel environment, it is essential to partition the entire operation into tasks and assign them to individual processing elements. Most of the existing approaches partition the given submatrices based on some kinds of workload estimation. For dense matrices on some architectures estimations may be accurate. For sparse matrices on PC, however, the workloads of block operations may not necessarily depend on the size of data. The workloads may not be well estimated in advance. Any approach other than runtime dynamic partitioning may degrade performance. Moreover, in a heterogeneous environment, statically partitioning is NPcomplete. For embedded problems, it also introduces management overhead. In this paper We adopt our superprogramming approach that partitions the entire task into mediumgrain tasks that are implemented using superinstructions; the workload of superinstructions is easy to estimate. These tasks are dynamically assigned to member computer nodes. A node may execute more than one superinstruction. Our results prove the viability of our approach.