Results 1 - 10
of
13
Provably good multicore cache performance for divide-and-conquer algorithms
- In Proc. 19th ACM-SIAM Sympos. Discrete Algorithms
, 2008
"... This paper presents a multicore-cache model that reflects the reality that multicore processors have both per-processor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divide-andconquer algorithms and present a new on-line scheduler, controlled-pdf, t ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
This paper presents a multicore-cache model that reflects the reality that multicore processors have both per-processor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divide-andconquer algorithms and present a new on-line scheduler, controlled-pdf, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicore-cache model under our new scheduler is within a constant factor of the sequential cache complexity for both L1 and L2, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptoticallyoptimal results for any multicore model. Finally, we show that a separator-based algorithm for sparse-matrix-densevector-multiply achieves provably good cache performance in the multicore-cache model, as well as in the well-studied sequential cache-oblivious model.
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, n-by-n matrix-multiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
permission. MINIMIZING COMMUNICATION IN NUMERICAL LINEAR ALGEBRA
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Graph Expansion and Communication Costs of Fast Matrix Multiplication
"... The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communi ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. 1.
Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods
- SIAM Journal on Scientific Computing
, 2009
"... Abstract. In this article, we introduce a cache-oblivious method for sparse matrix–vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraph-based sparse matrix partitioning scheme so that the resulting matrix induces cache-friendly b ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. In this article, we introduce a cache-oblivious method for sparse matrix–vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraph-based sparse matrix partitioning scheme so that the resulting matrix induces cache-friendly behavior during sparse matrix–vector multiplication. Matrices are assumed to be stored in row-major format, by means of the compressed row storage (CRS) or its variants incremental CRS and zig-zag CRS. The zig-zag CRS data structure is shown to fit well with the hypergraph metric used in partitioning sparse matrices for the purpose of parallel computation. The separated block-diagonal (SBD) form is shown to be the appropriate matrix structure for cache enhancement. We have implemented a run-time cache simulation library enabling us to analyze cache behavior for arbitrary matrices and arbitrary cache properties during matrix–vector multiplication within a k-way set-associative idealized cache model. The results of these simulations are then verified by actual experiments run on various cache architectures. In all these experiments, we use the Mondriaan sparse matrix partitioner in one-dimensional mode. The savings in computation time achieved by our matrix reorderings reach up to 50 percent, in the case of a large link matrix.
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
The I/O complexity of sparse matrix dense matrix multiplication
- In Proceedings of LATIN’10
, 2010
"... We consider the multiplication of a sparse N × N matrix A with a dense N × N matrix B in the I/O model. We determine the worst-case non-uniform complexity of this task up to a constant factor for all meaningful choices of the parameters N (dimension of the matrices), k (average number of non-zero en ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We consider the multiplication of a sparse N × N matrix A with a dense N × N matrix B in the I/O model. We determine the worst-case non-uniform complexity of this task up to a constant factor for all meaningful choices of the parameters N (dimension of the matrices), k (average number of non-zero entries per column or row in A, i.e., there are in total kN non-zero entries), M (main memory size), and B (block size), as long as M ≥ B 2 (tall cache assumption). For large and small k, the structure of the algorithm does not need to depend on the structure of the sparse matrix A, whereas for intermediate densities it is possible and necessary to find submatrices that fit in memory and are slightly denser than on average. The focus of this work is asymptotic worst-case complexity, i.e., the existence of matrices that require a certain number of I/Os and the existence of algorithms (sometimes depending on the shape of the sparse matrix) that use only a constant factor more I/Os. 1
Evaluating Non-Square Sparse Bilinear Forms on Multiple Vector Pairs in the I/O-Model
"... Abstract. We consider evaluating one bilinear form defined by a sparse Ny × Nx matrix A having h entries on w pairs of vectors The model of computation is the semiring I/O-model with main memory size M and block size B. For a range of low densities (small h), we determine the I/O-complexity of this ..."
Abstract
- Add to MetaCart
Abstract. We consider evaluating one bilinear form defined by a sparse Ny × Nx matrix A having h entries on w pairs of vectors The model of computation is the semiring I/O-model with main memory size M and block size B. For a range of low densities (small h), we determine the I/O-complexity of this task for all meaningful choices of Nx, Ny, w, M and B, as long as M ≥ B 2 (tall cache assumption). To this end, we present asymptotically optimal algorithms and matching lower bounds. Moreover, we show that multiplying the matrix A with w vectors has the same worst-case I/O-complexity. 1

