Results 1  10
of
15
Provably good multicore cache performance for divideandconquer algorithms
 In Proc. 19th ACMSIAM Sympos. Discrete Algorithms
, 2008
"... This paper presents a multicorecache model that reflects the reality that multicore processors have both perprocessor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divideandconquer algorithms and present a new online scheduler, controlledpdf, t ..."
Abstract

Cited by 35 (12 self)
 Add to MetaCart
This paper presents a multicorecache model that reflects the reality that multicore processors have both perprocessor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divideandconquer algorithms and present a new online scheduler, controlledpdf, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicorecache model under our new scheduler is within a constant factor of the sequential cache complexity for both L1 and L2, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptoticallyoptimal results for any multicore model. Finally, we show that a separatorbased algorithm for sparsematrixdensevectormultiply achieves provably good cache performance in the multicorecache model, as well as in the wellstudied sequential cacheoblivious model.
Minimizing Communication in Linear Algebra
, 2009
"... In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in ..."
Abstract

Cited by 17 (8 self)
 Add to MetaCart
In 1981 Hong and Kung [HK81] proved a lower bound on the amount of communication (amount of data moved between a small, fast memory and large, slow memory) needed to perform dense, nbyn matrixmultiplication using the conventional O(n 3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin [ITT04] gave a new proof of this result and extended it to the parallel case (where communication means the amount of data moved between processors). In both cases the lower bound may be expressed as Ω(#arithmetic operations / √ M), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDL T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. 1
GraphChi: Largescale Graph Computation On just a PC
 In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation, OSDI’12
, 2012
"... Current systems for graph computation require a distributed computing cluster to handle very large realworld problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains c ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Current systems for graph computation require a distributed computing cluster to handle very large realworld problems, such as analysis on social networks or the web graph. While distributed computational resources have become more accessible, developing distributed graph algorithms still remains challenging, especially to nonexperts. In this work, we present GraphChi, a diskbased system for computing efficiently on graphs with billions of edges. By using a wellknown method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumerlevel computer. We further extend GraphChi to support graphs that evolve over time, and demonstrate that, on a single computer, GraphChi can process over one hundred thousand graph updates per second, while simultaneously performing computation. We show, through experiments and theoretical analysis, that GraphChi performs well on both SSDs and rotational hard drives. By repeating experiments reported for existing distributed systems, we show that, with only fraction of the resources, GraphChi can solve the same problems in very reasonable time. Our work makes largescale graph computation available to anyone with a modern PC. 1
Graph Expansion and Communication Costs of Fast Matrix Multiplication
"... The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communi ..."
Abstract

Cited by 13 (11 self)
 Add to MetaCart
The communication cost of algorithms (also known as I/Ocomplexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen’s and other fast matrix multiplication algorithms, and obtain the first lower bounds on their communication costs. For sequential algorithms these bounds are attainable and so optimal. 1.
Cacheoblivious sparse matrixvector multiplication by using sparse matrix partitioning methods
 SIAM Journal on Scientific Computing
, 2009
"... Abstract. In this article, we introduce a cacheoblivious method for sparse matrix–vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraphbased sparse matrix partitioning scheme so that the resulting matrix induces cachefriendly b ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Abstract. In this article, we introduce a cacheoblivious method for sparse matrix–vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraphbased sparse matrix partitioning scheme so that the resulting matrix induces cachefriendly behavior during sparse matrix–vector multiplication. Matrices are assumed to be stored in rowmajor format, by means of the compressed row storage (CRS) or its variants incremental CRS and zigzag CRS. The zigzag CRS data structure is shown to fit well with the hypergraph metric used in partitioning sparse matrices for the purpose of parallel computation. The separated blockdiagonal (SBD) form is shown to be the appropriate matrix structure for cache enhancement. We have implemented a runtime cache simulation library enabling us to analyze cache behavior for arbitrary matrices and arbitrary cache properties during matrix–vector multiplication within a kway setassociative idealized cache model. The results of these simulations are then verified by actual experiments run on various cache architectures. In all these experiments, we use the Mondriaan sparse matrix partitioner in onedimensional mode. The savings in computation time achieved by our matrix reorderings reach up to 50 percent, in the case of a large link matrix.
On the Representation and Multiplication of Hypersparse Matrices
, 2008
"... Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
Reducedbandwidth multithreaded algorithms for sparse matrixvector multiplication
 In Proc. IPDPS
, 2011
"... Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymme ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract—On multicore architectures, the ratio of peak memory bandwidth to peak floatingpoint performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fillin zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue. I.
The I/O complexity of sparse matrix dense matrix multiplication
 In Proceedings of LATIN’10
, 2010
"... We consider the multiplication of a sparse N × N matrix A with a dense N × N matrix B in the I/O model. We determine the worstcase nonuniform complexity of this task up to a constant factor for all meaningful choices of the parameters N (dimension of the matrices), k (average number of nonzero en ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We consider the multiplication of a sparse N × N matrix A with a dense N × N matrix B in the I/O model. We determine the worstcase nonuniform complexity of this task up to a constant factor for all meaningful choices of the parameters N (dimension of the matrices), k (average number of nonzero entries per column or row in A, i.e., there are in total kN nonzero entries), M (main memory size), and B (block size), as long as M ≥ B 2 (tall cache assumption). For large and small k, the structure of the algorithm does not need to depend on the structure of the sparse matrix A, whereas for intermediate densities it is possible and necessary to find submatrices that fit in memory and are slightly denser than on average. The focus of this work is asymptotic worstcase complexity, i.e., the existence of matrices that require a certain number of I/Os and the existence of algorithms (sometimes depending on the shape of the sparse matrix) that use only a constant factor more I/Os. 1