Results 1  10
of
89
Efficient Computation of PageRank
, 1999
"... This paper discusses efficient techniques for computing PageRank, a ranking metric for hypertext documents. We show that PageRank can be computed for very large subgraphs of the web (up to hundreds of millions of nodes) on machines with limited main memory. Runningtime measurements on various memor ..."
Abstract

Cited by 146 (6 self)
 Add to MetaCart
(Show Context)
This paper discusses efficient techniques for computing PageRank, a ranking metric for hypertext documents. We show that PageRank can be computed for very large subgraphs of the web (up to hundreds of millions of nodes) on machines with limited main memory. Runningtime measurements on various memory configurations are presented for PageRank computation over the 24millionpage Stanford WebBase archive. We discuss several methods for analyzing the convergence of PageRank based on the induced ordering of the pages. We present convergence results helpful for determining the number of iterations necessary to achieve a useful PageRank assignment, both in the absence and presence of search queries.
Optimization of Sparse Matrixvector Multiplication on Emerging Multicore Platforms
 In Proc. SC2007: High performance computing, networking, and storage conference
, 2007
"... We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore spec ..."
Abstract

Cited by 146 (23 self)
 Add to MetaCart
(Show Context)
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrixvector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dualcore and Intel quadcore designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing stateoftheart serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memorybound numerical algorithms. 1.
Improving Performance of Sparse MatrixVector Multiplication
, 1999
"... Sparse matrixvector multiplication (SpMxV) is one of the most important computational kernels in scientific computing. It often suffers from poor cache utilization and extra load operations because of memory indirections used to exploit sparsity. We propose alternative data structures, along with r ..."
Abstract

Cited by 61 (3 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication (SpMxV) is one of the most important computational kernels in scientific computing. It often suffers from poor cache utilization and extra load operations because of memory indirections used to exploit sparsity. We propose alternative data structures, along with reordering algorithms to increase effectiveness of these data structures, to reduce the number of memory indirections. Toledo proposed handling the 1x2 blocks of a matrix separately, doing only one indirection for each block. We propose packing all contiguous nonzeros into a block to reduce the number of memory indirections further. This reduces memory indirections per block to one for the cost of an extra array in storage and a loop during SpMxV. We also propose an algorithm to permute the nonzeros of the matrix into contiguous locations. We state this problem as the traveling salesperson problem and use associated heuristics. Experiments verify the effectiveness of our techniques....
Performance Optimizations and Bounds for Sparse MatrixVector Multiply
 In Proceedings of Supercomputing
, 2002
"... We consider performance tuning, by code and data structure reorganization, of sparse matrixvector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how ..."
Abstract

Cited by 57 (9 self)
 Add to MetaCart
We consider performance tuning, by code and data structure reorganization, of sparse matrixvector multiply (SpMV), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits.
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY
 In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS
"... Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and be ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
(Show Context)
Sparse matrixvector multiplication is an important computational kernel that tends to perform poorly on modern processors, largely because of its high ratio of memory operations to arithmetic operations. Optimizing this algorithm is difficult, both because of the complexity of memory systems and because the performance is highly dependent on the nonzero structure of the matrix. The Sparsity system is designed to address these problem by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. The most difficult aspect of optimizing these algorithms is selecting among a large set of possible transformations and choosing parameters, such as block size. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that for matrices arising in scientific simulations, register level optimizations are critical, and we focus here on the optimizations and parameter selection techniques used in Sparsity for registerlevel optimizations. We demonstrate speedups of up to 2 for the single vector case and 5 for the multiple vector case.
Autotuning Performance on Multicore Computers
, 2008
"... personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires pri ..."
Abstract

Cited by 32 (10 self)
 Add to MetaCart
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific
Toward realistic performance bounds for implicit CFD codes
 Proceedings of Parallel CFD’99
, 1999
"... The performance of scientific computing applications often achieves a small fraction of peak performance [7,17]. In this paper, we discuss two causes of performance problems— insufficient memory bandwidth and a suboptimal instruction mix—in the context of a complete, parallel, unstructured mesh impl ..."
Abstract

Cited by 32 (13 self)
 Add to MetaCart
The performance of scientific computing applications often achieves a small fraction of peak performance [7,17]. In this paper, we discuss two causes of performance problems— insufficient memory bandwidth and a suboptimal instruction mix—in the context of a complete, parallel, unstructured mesh implicit CFD code. These results show that the performance of our code and of similar implicit codes is limited by the memory bandwidth of RISCbased processor nodes to as little as 10 % of peak performance for some critical computational kernels. Limits on the number of basic operations that can be performed in a single clock cycle also limit the performance of “cachefriendly ” parts of the code.
Selfadapting numerical software for next generation applications
 Int. J. High Perf. Comput. Appl
, 2002
"... The challenge for the development of next generation software is the successful management of the complex grid environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. SelfAdapting Numerical Software (SANS) systems are intended ..."
Abstract

Cited by 26 (6 self)
 Add to MetaCart
(Show Context)
The challenge for the development of next generation software is the successful management of the complex grid environment while delivering to the scientist the full power of flexible compositions of the available algorithmic alternatives. SelfAdapting Numerical Software (SANS) systems are intended to meet this significant challenge. A SANS system comprises intelligent next generation numerical software that domain scientists – with disparate levels of knowledge of algorithmic and programmatic complexities of the underlying numerical software – can use to easily express and efficiently solve their problem. The components of a SANS system are: • A SANS agent with: – An intelligent component that automates method selection based on data, algorithm and system attributes. – A system component that provides intelligent management of and access to the computational grid. – A history database that records relevant information generated by the intelligent component and maintains past performance data of the interaction (e.g., algorithmic, hardware specific, etc.) between SANS components. • A simple scripting language that allows a structured multilayered implementation of the SANS while ensuring portability and extensibility of the user interface and underlying libraries. • An XML/CCAbased vocabulary of metadata to describe behavioural properties of both data and algorithms. • System components, including a runtime adaptive scheduler, and prototype libraries that automate the process of architecturedependent tuning to optimize performance on different platforms. A SANS system can dramatically improve the ability of computational scientists to model complex, interdisciplinary phenomena with maximum efficiency and a minimum of extradomain expertise. SANS innovations (and their generalizations) will provide to the scientific and engineering community a dynamic computational environment in which the most effective library components are automatically selected based on the problem characteristics, data attributes, and the state of the grid. 1
Parallel sparse matrixvector and matrixtransposevector multiplication using compressed sparse blocks
 IN SPAA
, 2009
"... This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
(Show Context)
This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A T x to be computed efficiently in parallel, where A is an n × n sparse matrix with nnz ≥ n nonzeros and x is a dense nvector. Our algorithms use Θ(nnz) work (serial running time) and Θ ( √ nlgn) span (criticalpath length), yielding a parallelism of Θ(nnz / √ nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is esssentially the same as that for the morestandard compressedsparserows (CSR) format, for which computing Ax in parallel is easy but A T x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A T x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by offchip memory bandwidth.
Self Adapting Software for Numerical Linear Algebra and LAPACK for Clusters
 Parallel Computing
, 2003
"... This article describes the context, design, and recent development of the LAPACK for Clusters (LFC) project. It has been developed in the framework of SelfAdapting Numerical Software (SANS) since we believe such an approach can deliver the con venience and ease of use of existing sequential enviro ..."
Abstract

Cited by 24 (16 self)
 Add to MetaCart
This article describes the context, design, and recent development of the LAPACK for Clusters (LFC) project. It has been developed in the framework of SelfAdapting Numerical Software (SANS) since we believe such an approach can deliver the con venience and ease of use of existing sequential environments bundled with the power and versatility of highlytuned parallel codes that execute on clusters. Accomplishing this task is far from trivial as we argue in the paper by presenting pertinent case studies and possible usage scenarios.