Results 1  10
of
10
MemSpy: Analyzing Memory System Bottlenecks in Programs
 In Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems
, 1992
"... To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior if most references hit in th ..."
Abstract

Cited by 103 (9 self)
 Add to MetaCart
To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task. This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, Mem...
pSather monitors: Design, Tutorial, Rationale and Implementation
, 1991
"... pSather is a parallel extension of Sather aimed at shared memory parallel architectures. A prototype of the language is currently being implemented on a Sequent Symmetry and on SUN SparcStations. pSather monitors are one of the basic new features introduced in the language to deal with parallelism. ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
pSather is a parallel extension of Sather aimed at shared memory parallel architectures. A prototype of the language is currently being implemented on a Sequent Symmetry and on SUN SparcStations. pSather monitors are one of the basic new features introduced in the language to deal with parallelism. The current design is presented and discussed in detail. ICSI and Computer Science Division, U.C. Berkeley. Email jfeldman@icsi.berkeley.edu. y ICSI and Computer Science Division, U.C. Berkeley. Email clim@icsi.berkeley.edu. z ICSI and Istituto di Elaborazione dell'Informazione, CNR Pisa Italy. Email mazz@icsi.berkeley.edu. CONTENTS 1 Contents 1 Introduction 2 2 Monitor Design 5 2.1 Locking : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.2 Signals : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2.3 Forking : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2.4 T...
Parallel Direct Methods For Sparse Linear Systems
, 1997
"... We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We present an overview of parallel direct methods for solving sparse systems of linear equations, focusing on symmetric positive definite systems. We examine the performance implications of the important differences between dense and sparse systems. Our main emphasis is on parallel implementation of the numerically intensive factorization process, but we also briefly consider the other major components of direct methods, such as parallel ordering. Introduction In this paper we present a brief overview of parallel direct methods for solving sparse linear systems. Paradoxically, sparse matrix factorization offers additional opportunities for exploiting parallelism beyond those available with dense matrices, yet it is often more difficult to attain good efficiency in the sparse case. We examine both sides of this paradox: the additional parallelism induced by sparsity, and the difficulty in achieving high efficiency in spite of it. We focus on Cholesky factorization, primarily because th...
Stability Of The Partitioned Inverse Method For Parallel Solution Of Sparse Triangular Systems
 SIAM J. Scientific Computing
, 1994
"... . Several authors have recently considered a parallel method for solving sparse triangular systems with many righthand sides. The method employs a partition into sparse factors of the product form of the inverse of the coefficient matrix. It is shown here that while the method can be unstable, stab ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
. Several authors have recently considered a parallel method for solving sparse triangular systems with many righthand sides. The method employs a partition into sparse factors of the product form of the inverse of the coefficient matrix. It is shown here that while the method can be unstable, stability is guaranteed if a certain scalar that depends on the matrix and the partition is small, and that this scalar is small when the matrix is wellconditioned. Moreover, when the partition is chosen so that the factors have the same sparsity structure as the coefficient matrix, the backward error matrix can be taken to be sparse. Key words. sparse matrix, triangular system, substitution algorithm, parallel algorithm, rounding error analysis, matrix inverse. AMS subject classifications. primary 65F05, 65F50, 65G05 1. Introduction. The method of choice for solving triangular systems on a serial computer is the substitution algorithm. Several approaches have been suggested for parallel sol...
Factors Impacting Performance of Multithreaded Sparse Triangular Solve
"... Abstract. As computational science applications grow more parallel with multicore supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. As computational science applications grow more parallel with multicore supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in order to reduce the number of MPI tasks and increase the parallel efficiency of the algorithm. However, we need efficient threaded numerical kernels to run on the multicore nodes in order to achieve good parallel efficiency. In this paper, we focus on improving the performance of a multithreaded triangular solver, an important kernel for preconditioning. We analyze three factors that affect the parallel performance of this threaded kernel and obtain good scalability on the multicore nodes for a range of matrix sizes.
Evaluating Memory System Performance of a Large Scale NUMA Multiprocessor
, 1993
"... The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with nonuniform latencies. In this paper, we explore the implications of multivalued me ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with nonuniform latencies. In this paper, we explore the implications of multivalued memory latencies. In particular, we study the effect of a nonuniform traffic distribution on a hierarchical large scale NUMAmultiprocessor named Hector. Memory analysis is of interest because memory is a frequent source of poor performance in large scale multiprocessors. We have developed an analytical model that includes the effects of increased contention for system resources, and the impact of the arbitration algorithm on the network traffic. Our analysis has been validated with a detailed simulator. Also, we have examined two techniques for reducing memory latency. We assess the potential performance gains from replication of data and investigate the improvement in memory utilization by al...
Evaluating Memory System Performance of a Large Scale NUMA Multiprocessor
, 1993
"... The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with nonuniform latencies. In this paper, we explore the implications of multivalued me ..."
Abstract
 Add to MetaCart
The effectiveness of large scale computing depends to a great extent on the performance of the memory system. As shared memory multiprocessors grow in size, their memory hierarchy deepens, resulting in a design with nonuniform latencies. In this paper, we explore the implications of multivalued memory latencies. In particular, we study the effect of a nonuniform traffic distribution on a hierarchical large scale NUMAmultiprocessor named Hector. Memory analysis is of interest because memory is a frequent source of poor performance in large scale multiprocessors. We have developed an analytical model that includes the effects of increased contention for system resources, and the impact of the arbitration algorithm on the network traffic. Our analysis has been validated with a detailed simulator. Also, we have examined two techniques for reducing memory latency. We assess the potential performance gains from replication of data and investigate the improvement in memory utilization by al...
Scalable Solutions and Simulations
"... this paper, scalability will be divided into two components; scalability of the numerical algorithm specifically on parallel computer and algorithm or sequential y. The sequential implementation and scaling is initial presented ..."
Abstract
 Add to MetaCart
this paper, scalability will be divided into two components; scalability of the numerical algorithm specifically on parallel computer and algorithm or sequential y. The sequential implementation and scaling is initial presented
IncompleteLU and Cholesky Preconditioned Iterative Methods Using CUSPARSE and CUBLAS
, 2011
"... In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2 × speedup over CPU in the incompleteLU and Cholesky preconditioned iterative methods. We focus on the BiConjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sp ..."
Abstract
 Add to MetaCart
In this white paper we show how to use the CUSPARSE and CUBLAS libraries to achieve a 2 × speedup over CPU in the incompleteLU and Cholesky preconditioned iterative methods. We focus on the BiConjugate Gradient Stabilized and Conjugate Gradient iterative methods, that can be used to solve large sparse nonsymmetric and symmetric positive definite linear systems, respectively. Also, we comment on the parallel sparse triangular solve, which is an essential building block in these algorithms. 1
Scalable Solutions to IntegralEquation and FiniteElement Simulations
"... Abstract — When developing numerical methods, or applying them to the simulation and design of engineering components, it inevitably becomes necessary to examine the scaling of the method with a problem’s electrical size. The scaling results from the original mathematical development; for example, a ..."
Abstract
 Add to MetaCart
Abstract — When developing numerical methods, or applying them to the simulation and design of engineering components, it inevitably becomes necessary to examine the scaling of the method with a problem’s electrical size. The scaling results from the original mathematical development; for example, a dense system of equations in the solution of integral equations, as well as the specific numerical implementation. Scaling of the numerical implementation depends upon many factors; for example, direct or iterative methods for solution of the linear system, as well as the computer architecture used in the simulation. In this paper, scalability will be divided into two components—scalability of the numerical algorithm specifically on parallel computer systems and algorithm or sequential scalability. The sequential implementation and scaling is initially presented, with the parallel implementation following. This progression is meant to illustrate the differences in using current parallel platforms and sequential machines and the resulting savings. Time to solution (wallclock time) for differing problem sizes are the key parameters plotted or tabulated. Sequential and parallel scalability of time harmonic surface integral equation forms and the finiteelement solution to the partial differential equations are considered in detail. Index Terms—Finiteelement methods, integral equations.