Results 1 -
9 of
9
MemSpy: Analyzing Memory System Bottlenecks in Programs
- In Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems
, 1992
"... To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior--- if most references hit in th ..."
Abstract
-
Cited by 99 (9 self)
- Add to MetaCart
To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior--- if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task. This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, Mem...
pSather monitors: Design, Tutorial, Rationale and Implementation
, 1991
"... pSather is a parallel extension of Sather aimed at shared memory parallel architectures. A prototype of the language is currently being implemented on a Sequent Symmetry and on SUN Sparc-Stations. pSather monitors are one of the basic new features introduced in the language to deal with parallelism. ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
pSather is a parallel extension of Sather aimed at shared memory parallel architectures. A prototype of the language is currently being implemented on a Sequent Symmetry and on SUN Sparc-Stations. pSather monitors are one of the basic new features introduced in the language to deal with parallelism. The current design is presented and discussed in detail. ICSI and Computer Science Division, U.C. Berkeley. E-mail jfeldman@icsi.berkeley.edu. y ICSI and Computer Science Division, U.C. Berkeley. E-mail clim@icsi.berkeley.edu. z ICSI and Istituto di Elaborazione dell'Informazione, CNR Pisa Italy. E-mail mazz@icsi.berkeley.edu. CONTENTS 1 Contents 1 Introduction 2 2 Monitor Design 5 2.1 Locking : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.2 Signals : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2.3 Forking : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2.4 T...
Parallel Direct Methods For Sparse Linear Systems
, 1997
"... We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of the numerically intensive factorization process, but we also briefly consider the other major components of direct methods, such as parallel ordering. Introduction In this paper we present a brief overview of parallel direct methods for solving sparse linear systems. Paradoxically, sparse matrix factorization offers additional opportunities for exploiting parallelism beyond those available with dense matrices, yet it is often more difficult to attain good efficiency in the sparse case. We examine both sides of this paradox: the additional parallelism induced by sparsity, and the difficulty in achieving high efficiency in spite of it. We focus on Cholesky factorization, primarily because th...
Evaluating Memory System Performance of a Large Scale NUMA Multiprocessor
, 1993
"... The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with non-uniform latencies. In this paper, we explore the implications of multi-valued me ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with non-uniform latencies. In this paper, we explore the implications of multi-valued memory latencies. In particular, we study the effect of a nonuniform traffic distribution on a hierarchical large scale NUMAmultiprocessor named Hector. Memory analysis is of interest because memory is a frequent source of poor performance in large scale multiprocessors. We have developed an analytical model that includes the effects of increased contention for system resources, and the impact of the arbitration algorithm on the network traffic. Our analysis has been validated with a detailed simulator. Also, we have examined two techniques for reducing memory latency. We assess the potential performance gains from replication of data and investigate the improvement in memory utilization by al...
Stability Of The Partitioned Inverse Method For Parallel Solution Of Sparse Triangular Systems
- SIAM J. Scientific Computing
, 1994
"... . Several authors have recently considered a parallel method for solving sparse triangular systems with many right-hand sides. The method employs a partition into sparse factors of the product form of the inverse of the coefficient matrix. It is shown here that while the method can be unstable, stab ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
. Several authors have recently considered a parallel method for solving sparse triangular systems with many right-hand sides. The method employs a partition into sparse factors of the product form of the inverse of the coefficient matrix. It is shown here that while the method can be unstable, stability is guaranteed if a certain scalar that depends on the matrix and the partition is small, and that this scalar is small when the matrix is well-conditioned. Moreover, when the partition is chosen so that the factors have the same sparsity structure as the coefficient matrix, the backward error matrix can be taken to be sparse. Key words. sparse matrix, triangular system, substitution algorithm, parallel algorithm, rounding error analysis, matrix inverse. AMS subject classifications. primary 65F05, 65F50, 65G05 1. Introduction. The method of choice for solving triangular systems on a serial computer is the substitution algorithm. Several approaches have been suggested for parallel sol...
Factors Impacting Performance of Multithreaded Sparse Triangular Solve
"... Abstract. As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in order to reduce the number of MPI tasks and increase the parallel efficiency of the algorithm. However, we need efficient threaded numerical kernels to run on the multi-core nodes in order to achieve good parallel efficiency. In this paper, we focus on improving the performance of a multithreaded triangular solver, an important kernel for preconditioning. We analyze three factors that affect the parallel performance of this threaded kernel and obtain good scalability on the multi-core nodes for a range of matrix sizes.
Evaluating Memory System Performance of a Large Scale NUMA Multiprocessor
, 1993
"... The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with non-uniform latencies. In this paper, we explore the implications of multi-valued me ..."
Abstract
- Add to MetaCart
The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with non-uniform latencies. In this paper, we explore the implications of multi-valued memory latencies. In particular, we study the effect of a nonuniform traffic distribution on a hierarchical large scale NUMAmultiprocessor named Hector. Memory analysis is of interest because memory is a frequent source of poor performance in large scale multiprocessors. We have developed an analytical model that includes the effects of increased contention for system resources, and the impact of the arbitration algorithm on the network traffic. Our analysis has been validated with a detailed simulator. Also, we have examined two techniques for reducing memory latency. We assess the potential performance gains from replication of data and investigate the improvement in memory utilization by al...
Scalable Solutions and Simulations
"... this paper, scalability will be divided into two components; scalability of the numerical algorithm specifically on parallel computer and algorithm or sequential y. The sequential implementation and scaling is initial presented ..."
Abstract
- Add to MetaCart
this paper, scalability will be divided into two components; scalability of the numerical algorithm specifically on parallel computer and algorithm or sequential y. The sequential implementation and scaling is initial presented
Incomplete-LU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS
, 2011
"... In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2 × speedup over CPU in the incomplete-LU and Cholesky preconditioned iterative methods. We focus on the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sp ..."
Abstract
- Add to MetaCart
In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2 × speedup over CPU in the incomplete-LU and Cholesky preconditioned iterative methods. We focus on the Bi-Conjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sparse nonsymmetric and symmetric positive definite linear systems, respectively. Also, we comment on the parallel sparse triangular solve, which is an essential building block in these algorithms. 1

