Results 1 - 10
of
21
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors
- The 12th International Conference on High Performance Computing (HiPC 2005)
, 2005
"... Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that su ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multi-graph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
Fast Shared-Memory Algorithms for Computing the Minimum Spanning Forest of Sparse Graphs
, 2006
"... ..."
Cache-oblivious dynamic programming
- In Proc. of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’06
, 2006
"... We present efficient cache-oblivious algorithms for several fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subsequence. We present a new cache-oblivi ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
We present efficient cache-oblivious algorithms for several fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subsequence. We present a new cache-oblivious framework called the Gaussian Elimination Paradigm (GEP) for Gaussian elimination without pivoting that also gives cache-oblivious algorithms for Floyd-Warshall all-pairs shortest paths in graphs and ‘simple DP’, among other problems. 1
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
The Cache-Oblivious Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and Experimental Evaluation
, 2009
"... We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce the number of I/Os performed by the computation o ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce the number of I/Os performed by the computation over that performed by standard GEP by a factor of √ M, where M is the size of the cache. Cache-oblivious I-GEP computes in-place and solves most of the known applications of GEP including Gaussian elimination and LU-decomposition without pivoting and Floyd-Warshall all-pairs shortest paths. Cache-oblivious C-GEP uses a modest amount of additional space, but is completely general and applies to any code in GEP form. Both I-GEP and C-GEP produce system-independent cache-efficient code, and are potentially applicable to being used by optimizing compilers for loop transformation. We present parallel I-GEP and C-GEP that achieve good speed-up and match the sequential caching performance cache-obliviously for both shared and distributed caches for sufficiently large inputs. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations, and compare them with finelytuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive trade-off between efficiency and portability.
permission. MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
The cache-oblivious Gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation
- In SPAA ’07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
, 2007
"... The cache-oblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to obtain efficient cache-oblivious algorithms for several important problems that have algorithms with triply-nested loops similar to those that occur in Gaussian elimination. These include Gaussian elimina ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The cache-oblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to obtain efficient cache-oblivious algorithms for several important problems that have algorithms with triply-nested loops similar to those that occur in Gaussian elimination. These include Gaussian elimination and LU-decomposition without pivoting, all-pairs shortest paths and matrix multiplication. In this paper, we prove several important properties of the cache-oblivious framework for GEP given in [6], which we denote by I-GEP. We build on these results to obtain C-GEP, a completely general cache-oblivious implementation of GEP that applies to any code in GEP form, and which has the same time and I/O bounds as the earlier algorithm in [6], while using a modest amount of additional space. We present an experimental evaluation of the caching performance of I-GEP and C-GEP in relation to the traditional Gaussian elimination algorithm. Our experimental results indicate that I-GEP and C-GEP outperform GEP on inputs of reasonable size, with dramatic improvement in running time over GEP when the data is out of core. ‘Tiling’, an important loop transformation technique employed by optimizing compilers in order to improve temporal locality in nested loops, is a cache-aware method that does not adapt to all levels in a multi-level memory hierarchy. The cache-oblivious GEP framework (either I-GEP or C-GEP) produces system-independent I/O-efficient code for triply nested loops of the form that appears in Gaussian elimination without pivoting, and is potentially applicable to being used by optimizing compilers for loop transformation.
Parallel Shortest Path Algorithms for Solving . . .
, 2006
"... We present an experimental study of the single source shortest path problem with non-negative edge weights (NSSP) on large-scale graphs using the ∆-stepping parallel algorithm. We report performance results on the Cray MTA-2, a multithreaded parallel computer. The MTA-2 is a high-end shared memory s ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
We present an experimental study of the single source shortest path problem with non-negative edge weights (NSSP) on large-scale graphs using the ∆-stepping parallel algorithm. We report performance results on the Cray MTA-2, a multithreaded parallel computer. The MTA-2 is a high-end shared memory system offering two unique features that aid the efficient parallel implementation of irregular algorithms: the ability to exploit fine-grained parallelism, and low-overhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with competitive sequential algorithms, for low-diameter sparse graphs. For instance, ∆-stepping on a directed scale-free graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA-2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a shortest path problem on realistic graph instances in the order of billions of vertices and edges.
Graph Expansion and Communication Costs of Fast Matrix Multiplication
"... The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communi ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. 1.
R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks
, 2007
"... We propose a novel divide-and-conquer algorithm for the solution of the all-pair shortest-path problem for directed and dense graphs with no negative cycles. We propose R-Kleene, a compact and in-place recursive algorithm inspired by Kleene’s algorithm. R-Kleene delivers a better performance than p ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We propose a novel divide-and-conquer algorithm for the solution of the all-pair shortest-path problem for directed and dense graphs with no negative cycles. We propose R-Kleene, a compact and in-place recursive algorithm inspired by Kleene’s algorithm. R-Kleene delivers a better performance than previous algorithms for randomly generated graphs represented by highly dense adjacency matrices, in which the matrix components can have any integer value. We show that R-Kleene, unchanged and without any machine tuning, yields consistently between 1/7 and 1/2 of the peak performance running on five very different uniprocessor systems.

