Results 1  10
of
17
An early performance analysis of cloud computing services for scientific computing
 TU Delft, Tech. Rep., Dec 2008, [Online] Available
"... Abstract—Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing facilities by companies and institutes alike.Throughtheuseofvirtualizationandresourcetimesharing, clouds serve with a single set of physical resources a ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
Abstract—Cloud computing is an emerging commercial infrastructure paradigm that promises to eliminate the need for maintaining expensive computing facilities by companies and institutes alike.Throughtheuseofvirtualizationandresourcetimesharing, clouds serve with a single set of physical resources a large user base withdifferentneeds.Thus,cloudshavethepotentialtoprovide to their owners the benefits of an economy of scale and, at the same time, becomeanalternativeforscientiststoclusters,grids,and parallel production environments. However, the current commercial clouds have been built to support web and small database workloads, which are very different from typical scientific computing workloads. Moreover, the use of virtualization and resource timesharing may introduce significant performance penalties for the demanding scientific computing workloads. In this work we analyze the performance of cloud computing services for scientific computing workloads. We quantify the presence in real scientific computing workloads of ManyTask Computing (MTC) users, that is, of users who employ looselycoupledapplicationscomprisingmanytaskstoachieve their scientific goals. Then, we perform an empirical evaluation of theperformanceoffourcommercialcloudcomputingservices including Amazon EC2, which is currently the largest commercial cloud. Last,wecomparethroughtracebasedsimulationtheperformance characteristics and cost models of clouds and other scientific computing platforms, for general and MTCbased scientific computing workloads. Our results indicate that the current clouds need an order of magnitude in performance improvement to be useful tothe scientific community, and show which improvements should be considered first to address this discrepancy between offer and demand.
Programming matrix algorithmsbyblocks for threadlevel parallelism
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
"... With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution ..."
Abstract

Cited by 29 (18 self)
 Add to MetaCart
With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithmsbyblocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads outoforder and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithmsbyblocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithmbyblocks for the QR factorization, both originally designed for outofcore computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest
Adaptive Winograd’s Matrix Multiplications
, 2008
"... Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of cou ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Modern architectures have complex memory hierarchies and increasing parallelism (e.g., multicores). These features make achieving and maintaining good performance across rapidly changing architectures increasingly difficult. Performance has become a complex tradeoff, not just a simple matter of counting cost of simple CPU operations. We present a novel, hybrid, and adaptive recursive StrassenWinograd’s matrix multiplication (MM) that uses automatically tuned linear algebra software (ATLAS) or GotoBLAS. Our algorithm applies to any size and shape matrices stored in either row or column major layout (in doubleprecision in this work) and thus is efficiently applicable to both C and FORTRAN implementations. In addition, our algorithm divides the computation into equivalent incomplexity subMMs and does not require any extra computation to combine the intermediary subMM results. We achieve up to 22 % executiontime reduction versus GotoBLAS/ATLAS alone for a single core system and up to 19 % for a 2 dualcore processor system. Most importantly, even for small matrices such as 1500×1500, our approach attains already 10 % executiontime reduction and, for MM of matrices larger than 3000×3000, it delivers performance that would correspond, for a classic O(n3) algorithm, to fasterthanprocessor peak performance (i.e., our algorithm delivers the equivalent of 5 GFLOPS performance on a system with 4.4 GFLOPS peak performance and where GotoBLAS achieves only 4 GFLOPS). This is a result of the savings in operations (and thus FLOPS). Therefore, our algorithm is faster than any classic MM algorithms could ever be for matrices of this size. Furthermore, we present experimental evidence based on established methodologies found in the literature that our algorithm is, for a family of matrices, as accurate as the classic algorithms.
Cacheoptimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have nearoptimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.
High Throughput FPGAbased Floating Point Conjugate Gradient Implementation
 Proc. Applied Reconfigurable Computing
"... (FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(FPGAs) have significantly expanded their applications. One such field is the acceleration of scientific computation and one type of calculation that is commonplace in scientific computation is the solution of systems of linear equations. A method that has proven in software to be very efficient and robust for finding such solutions is the Conjugate Gradient (CG) algorithm. In this paper we present a widelyparallel and deeplypipelined hardware CG implementation, targeted at modern FPGA architectures. This implementation is particularly suited for accelerating multiple smalltomedium sized dense systems of linear equations and can be used as a stand alone solver or as building block to solve higher order systems. In this paper it is shown that through parallelization it is possible to convert the computation time per iteration for an order n matrix from Θ(n 2) clock cycles on a microprocessor to Θ(n) on a FPGA. Through deeppipelining it is also possible to solve several problems in parallel and maximize both performance and efficiency. I/O requirements are shown to be scalable and convergent to a constant value with the increase of matrix order. Post placeandroute results on a readily available VirtexII6000 demonstrate sustained performance of 5 GFLOPS, and results on a Virtex5330 indicate sustained performance of 35 GFLOPS. A comparison with an optimized software implementation running on a highend CPU, demonstrate that this FPGA implementation represents a significant speedup of at least an order of magnitude. 1
Updating an LU Factorization with Pivoting
, 2006
"... We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra Methods Environment (FLA ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
We show how to compute an LU factorization of a matrix when the factors of a leading principle submatrix are already known. The approach incorporates pivoting akin to partial pivoting, a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra Methods Environment (FLAME) Application Programming Interface (API) is described. Experimental results demonstrate practical numerical stability and high performance on an Intel Itanium2 processor based server.
Exploiting Parallelism in MatrixComputation Kernels for Symmetric Multiprocessor Systems  MatrixMultiplication and MatrixAddition Algorithm Optimizations by Software Pipelining and Threads Allocation
, 2011
"... We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen’s and Winograd’s fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen’s until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern symmetric multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as functioncall parallelism, function percolation, and function software pipelining. We have three contributions: first, we present a performance overview for double and double complex precision matrices for stateoftheart SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd’s matrix multiplication (achieving up to 20 % speed up w.r.t. regular matrix multiplication). About the latter Winograd’s algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we
Reducing the Worst Case Running Times of a Family of RNA and CFG Problems, Using Valiant’s Approach
"... Abstract. We study Valiant’s classical algorithm for Context Free Grammar recognition in subcubic time, and extract features that are common to problems on which Valiant’s approach can be applied. Based on this, we describe several problem templates, and formulate generic algorithms that use Valian ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. We study Valiant’s classical algorithm for Context Free Grammar recognition in subcubic time, and extract features that are common to problems on which Valiant’s approach can be applied. Based on this, we describe several problem templates, and formulate generic algorithms that use Valiant’s technique and can be applied to all problems which abide by these templates. These algorithms obtain new worst case running time bounds for a large family of important problems within the world of RNA Secondary Structures and Context Free Grammars. 1
Restructuring the QR Algorithm for Performance
"... We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermi ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We show how the QR algorithm can be restructured so that it becomes rich in operations that can achieve nearpeak performance on a modern processor. The key is a novel algorithm for applying multiple sets of Givens rotations. We demonstrate the merits of this new QR algorithm for computing the Hermitian (symmetric) eigenvalue decomposition and singular value decomposition of dense matrices when all eigenvectors/singular vectors are computed. The approach yields vastly improved performance relative to the traditional QR algorithm and is competitive with two commonly used alternatives—Cuppen’s Divide and Conquer algorithm and the Method of Multiple Relatively Robust Representations—while inheriting the more modest O(n) workspace requirements of the original QR algorithm. Since the computations performed by the restructured algorithm remain essentially identical to those performed by the original method, robust numerical properties are preserved.
Scalable Parallelization of FLAME Code via the Workqueuing Model
"... We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formal ..."
Abstract
 Add to MetaCart
We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domainspecific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a twodimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rankk update and report impressive performance on an SGI system with 14 Itanium2 processors.