Results 1  10
of
21
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors
 The 12th International Conference on High Performance Computing (HiPC 2005)
, 2005
"... Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that su ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multigraph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
Fast SharedMemory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
The CacheOblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
, 2009
"... We consider triplynested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cacheoblivious methods IGEP and CGEP, both of which reduce the number of I/Os performed by the computation o ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
We consider triplynested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cacheoblivious methods IGEP and CGEP, both of which reduce the number of I/Os performed by the computation over that performed by standard GEP by a factor of √ M, where M is the size of the cache. Cacheoblivious IGEP computes inplace and solves most of the known applications of GEP including Gaussian elimination and LUdecomposition without pivoting and FloydWarshall allpairs shortest paths. Cacheoblivious CGEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both IGEP and CGEP produce systemindependent cacheefficient code, and are potentially applicable to being used by optimizing compilers for loop transformation. We present parallel IGEP and CGEP that achieve good speedup and match the sequential caching performance cacheobliviously for both shared and distributed caches for sufficiently large inputs. We present extensive experimental results for both incore and outofcore performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finelytuned cacheaware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cacheoblivious GEP offers an attractive tradeoff between efficiency and portability.
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
Cacheoblivious dynamic programming
 In Proc. of the Seventeenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’06
, 2006
"... We present efficient cacheoblivious algorithms for several fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subsequence. We present a new cacheoblivi ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
We present efficient cacheoblivious algorithms for several fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subsequence. We present a new cacheoblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cacheoblivious algorithms for FloydWarshall allpairs shortest paths in graphs and ‘simple DP’, among other problems. 1
Graph Expansion and Communication Costs of Fast Matrix Multiplication
"... The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communi ..."
Abstract

Cited by 13 (11 self)
 Add to MetaCart
The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. 1.
RKleene: A HighPerformance DivideandConquer Algorithm for the AllPair Shortest Path for Densely Connected Networks
, 2007
"... We propose a novel divideandconquer algorithm for the solution of the allpair shortestpath problem for directed and dense graphs with no negative cycles. We propose RKleene, a compact and inplace recursive algorithm inspired by Kleene’s algorithm. RKleene delivers a better performance than p ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
We propose a novel divideandconquer algorithm for the solution of the allpair shortestpath problem for directed and dense graphs with no negative cycles. We propose RKleene, a compact and inplace recursive algorithm inspired by Kleene’s algorithm. RKleene delivers a better performance than previous algorithms for randomly generated graphs represented by highly dense adjacency matrices, in which the matrix components can have any integer value. We show that RKleene, unchanged and without any machine tuning, yields consistently between 1/7 and 1/2 of the peak performance running on five very different uniprocessor systems.
Parallel Shortest Path Algorithms for Solving . . .
, 2006
"... We present an experimental study of the single source shortest path problem with nonnegative edge weights (NSSP) on largescale graphs using the ∆stepping parallel algorithm. We report performance results on the Cray MTA2, a multithreaded parallel computer. The MTA2 is a highend shared memory s ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We present an experimental study of the single source shortest path problem with nonnegative edge weights (NSSP) on largescale graphs using the ∆stepping parallel algorithm. We report performance results on the Cray MTA2, a multithreaded parallel computer. The MTA2 is a highend shared memory system offering two unique features that aid the efficient parallel implementation of irregular algorithms: the ability to exploit finegrained parallelism, and lowoverhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with competitive sequential algorithms, for lowdiameter sparse graphs. For instance, ∆stepping on a directed scalefree graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a shortest path problem on realistic graph instances in the order of billions of vertices and edges.
The cacheoblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation
 In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
, 2007
"... The cacheoblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to obtain efficient cacheoblivious algorithms for several important problems that have algorithms with triplynested loops similar to those that occur in Gaussian elimination. These include Gaussian elimina ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
The cacheoblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to obtain efficient cacheoblivious algorithms for several important problems that have algorithms with triplynested loops similar to those that occur in Gaussian elimination. These include Gaussian elimination and LUdecomposition without pivoting, allpairs shortest paths and matrix multiplication. In this paper, we prove several important properties of the cacheoblivious framework for GEP given in [6], which we denote by IGEP. We build on these results to obtain CGEP, a completely general cacheoblivious implementation of GEP that applies to any code in GEP form, and which has the same time and I/O bounds as the earlier algorithm in [6], while using a modest amount of additional space. We present an experimental evaluation of the caching performance of IGEP and CGEP in relation to the traditional Gaussian elimination algorithm. Our experimental results indicate that IGEP and CGEP outperform GEP on inputs of reasonable size, with dramatic improvement in running time over GEP when the data is out of core. ‘Tiling’, an important loop transformation technique employed by optimizing compilers in order to improve temporal locality in nested loops, is a cacheaware method that does not adapt to all levels in a multilevel memory hierarchy. The cacheoblivious GEP framework (either IGEP or CGEP) produces systemindependent I/Oefficient code for triply nested loops of the form that appears in Gaussian elimination without pivoting, and is potentially applicable to being used by optimizing compilers for loop transformation.
Cacheoptimal algorithms for option pricing
, 2008
"... Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial model ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have nearoptimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70 % of peak performance.