Results 1 - 10
of
10
An early performance analysis of cloud computing services for scientific computing
- TU Delft, Tech. Rep., Dec 2008, [Online] Available
"... Abstract—Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing facilities by companies and institutes alike.Throughtheuseofvirtualizationandresourcetime-sharing, clouds serve with a single set of physical resources a ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Abstract—Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing facilities by companies and institutes alike.Throughtheuseofvirtualizationandresourcetime-sharing, clouds serve with a single set of physical resources a large user base withdifferentneeds.Thus,cloudshavethepotentialtoprovide to their owners the benefits of an economy of scale and, at the same time, becomeanalternativeforscientiststoclusters,grids,and parallel production environments. However, the current commercial clouds have been built to support web and small database workloads, which are very different from typical scientific computing workloads. Moreover, the use of virtualization and resource time-sharing may introduce significant performance penalties for the demanding scientific computing workloads. In this work we analyze the performance of cloud computing services for scientific computing workloads. We quantify the presence in real scientific computing workloads of Many-Task Computing (MTC) users, that is, of users who employ looselycoupledapplicationscomprisingmanytaskstoachieve their scientific goals. Then, we perform an empirical evaluation of theperformanceoffourcommercialcloudcomputingservices including Amazon EC2, which is currently the largest commercial cloud. Last,wecomparethroughtrace-basedsimulationtheperformance characteristics and cost models of clouds and other scientific computing platforms, for general and MTC-based scientific computing workloads. Our results indicate that the current clouds need an order of magnitude in performance improvement to be useful tothe scientific community, and show which improvements should be considered first to address this discrepancy between offer and demand.
Programming matrix algorithms-by-blocks for thread-level parallelism
- ACM Transactions on Mathematical Software
"... With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution ..."
Abstract
-
Cited by 17 (15 self)
- Add to MetaCart
With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest
Updating an LU Factorization with Pivoting
, 2006
"... We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra Methods Environment (FLA ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra Methods Environment (FLAME) Application Programming Interface (API) is described. Experimental results demonstrate practical numerical stability and high performance on an Intel Itanium2 processor based server.
Adaptive Winograd’s Matrix Multiplications
, 2008
"... Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex trade-off, not just a simple matter of cou ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex trade-off, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive Strassen-Winograd’s matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in double-precision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent in-complexity sub-MMs and does not require any extra computation to combine the intermediary sub-MM results. We achieve up to 22 % execution-time reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19 % for a 2 dual-core processor system. Most importantly, even for small matrices such as 1500×1500, our approach attains already 10 % execution-time reduction and, for MM of matrices larger than 3000×3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to faster-than-processor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
Cache-optimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have near-optimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4-level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
Scalable Parallelization of FLAME Code via the Workqueuing Model
"... We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formal ..."
Abstract
- Add to MetaCart
We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.
Exploiting Parallelism in Matrix-Computation Kernels for Symmetric Multiprocessor Systems -- Matrix-Multiplication and Matrix-Addition Algorithm Optimizations by Software Pipelining and Threads Allocation
, 2011
"... We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until ..."
Abstract
- Add to MetaCart
We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern symmetric multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining. We have three contributions: first, we present a performance overview for double and double complex precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd’s matrix multiplication (achieving up to 20 % speed up w.r.t. regular matrix multiplication). About the latter Winograd’s algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we
Restructuring the QR Algorithm for Performance
"... We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve near-peak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermi ..."
Abstract
- Add to MetaCart
We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve near-peak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermitian (symmetric) eigenvalue decomposition and singular value decomposition of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithm and is competitive with two commonly used alternatives—Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—while inheriting the more modest O(n) workspace requirements of the original QR algorithm. Since the computations performed by the restructured algorithm remain essentially identical to those performed by the original method, robust numerical properties are preserved.
In submission. Please do not redistribute. Haskell Beats C Using Generalized Stream Fusion Geoffrey Mainland
"... Stream fusion [3] is a powerful technique for automatically transforming high-level sequence-processing functions into efficient implementations. It has been used to great effect in Haskell libraries for manipulating byte arrays, Unicode text, and unboxed vectors. However, some operations, like vect ..."
Abstract
- Add to MetaCart
Stream fusion [3] is a powerful technique for automatically transforming high-level sequence-processing functions into efficient implementations. It has been used to great effect in Haskell libraries for manipulating byte arrays, Unicode text, and unboxed vectors. However, some operations, like vector append, still do not perform well within the standard stream fusion framework. Others, like SIMD computation using the SSE and AVX instructions available on modern x86 chips, do not seem to fit in the framework at all. In this paper we introduce generalized stream fusion, which solves these issues. The key insight is to bundle together multiple stream representations, each tuned for a particular class of stream consumer. We also describe a stream representation suited for efficient computation with SSE instructions. Our ideas are implemented in modified versions of the GHC compiler and vector library. Benchmarks show that high-level Haskell code written using our compiler and libraries can produce code that is competitive with hand-tuned assembly. 1.

