Results 1  10
of
39
Krylov subspace methods on supercomputers
 SIAM J. SCI. STAT. COMPUT
, 1989
"... This paper presents a short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three dimensional model ..."
Abstract

Cited by 68 (4 self)
 Add to MetaCart
This paper presents a short survey of recent research on Krylov subspace methods with emphasis on implementation on vector and parallel computers. Conjugate gradient methods have proven very useful on traditional scalar computers, and their popularity is likely to increase as three dimensional models gain importance. A conservative approach to derive effective iterative techniques for supercomputers has been to find efficient parallel / vector implementations of the standard algorithms. The main source of difficulty in the incomplete factorization preconditionings is in the solution of the triangular systems at each step. We describe in detail a few approaches consisting of implementing efficient forward and backward triangular solutions. Then we discuss polynomial preconditioning as an alternative to standard incomplete factorization techniques. Another efficient approach is to reorder the equations so as improve the structure of the matrix to achieve better parallelism or vectorization. We give an overview of these ideas and others and attempt to comment on their effectiveness or potential for different types of architectures.
NUMA policies and their relation to memory architecture
 In Architectural Support for Programming Languages and Operating Systems
, 1991
"... Multiprocessor memory reference traces provide a wealth of information on the behavior of parallel programs. We have used this information to explore the relationship between kernelbased NUMA management policies and multiprocessor memory architecture. Our trace analysis techniques employ an offlin ..."
Abstract

Cited by 49 (18 self)
 Add to MetaCart
Multiprocessor memory reference traces provide a wealth of information on the behavior of parallel programs. We have used this information to explore the relationship between kernelbased NUMA management policies and multiprocessor memory architecture. Our trace analysis techniques employ an offline, optimal cost policy as a baseline against which to compare online policies, and as a policyinsensitive tool for evaluating architectural design alternatives. We compare the performance of our optimal policy with that of three implementable policies (two of which appear in previous work), on a variety of applications, with varying relative speeds for page moves and local, global, and remote memory references. Our results indicate that a good NUMA policy must be chosen to match its machine, and confirm that such policies can be both simple and effective. They also indicate that programs for NUMA machines must be written with care to obtain the best performance. 1
Efficient Support for Multicomputing on ATM Networks
, 1993
"... The emergence of a new generation of networks will dramatically increase the attractiveness of looselycoupled multicomputers based on workstation clusters. The key to achieving high performance in this environment is efficient network access, because the cost of remote access dictates the granulari ..."
Abstract

Cited by 46 (3 self)
 Add to MetaCart
The emergence of a new generation of networks will dramatically increase the attractiveness of looselycoupled multicomputers based on workstation clusters. The key to achieving high performance in this environment is efficient network access, because the cost of remote access dictates the granularity of parallelism that can be supported. Thus, in addition to traditional distribution mechanisms such as RPC, workstation clusters should support lightweight communication paradigms for executing parallel applications. This paper describes a simple communication model based on the notion of remote memory access. Applications executing on one host can perform direct memory read or write operations on userdefined remote memory buffers. We have implemented a prototype system based on this model using commercially available workstations and ATM networks. Our prototype uses kernelbased emulation of remote read and write instructions, implemented through unused processor opcodes; thus, applica...
Communication Complexity for Parallel DivideandConquer
 In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science
, 1991
"... This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may e ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
This paper studies the relationship between parallel computation cost and communication cost for performing divideandconquer (D&C) computations on a parallel system of p processors. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may expand, whereas the communication cost is the total number of cross nodes. A cross node is a node which is generated by one processor but expanded by another processor. A new scheduling algorithm is proposed, whose parallel computation cost and communication cost are at most dN=pe and pdh, respectively, for any D&C computation tree with N nodes, height h, and degree d. Also, lower bounds on the communication cost are derived. In particular, it is shown that for each scheduling algorithm and for each positive ffl C ! 1, which can be arbitrarily close to 0, there are values of N , h, d, p, and ffl T (? 0), for which if the parallel computation cost is between N=p (the minimum) and (1 + ffl T ...
Iterative algorithms for solution of large sparse systems of linear equations on hypercubes
 IEEE Transactions on Computers
, 1988
"... requires large amounts of computing power. The finite element method [l] is a powerful numerical technique for solving boundary value problems involving partial differential equations in engineering fields such as heat flow analysis, metal forming, and others. As a result of finite element discretiz ..."
Abstract

Cited by 25 (12 self)
 Add to MetaCart
requires large amounts of computing power. The finite element method [l] is a powerful numerical technique for solving boundary value problems involving partial differential equations in engineering fields such as heat flow analysis, metal forming, and others. As a result of finite element discretization, linear equations in the form Ax = b are obtained where A is large, sparse, and banded with proper ordering of the variables x. In this paper, solution of such equations on distributedmemory messagepassing multiprocessors implementing the hypercube [2] topology is addressed. Iterative algorithms based on the Conjugate Gradient method are developed for hypercubes designed for coarse grain parallelism. Communication requirements of different schemes for mapping finite element meshes onto the processors of a hypercube are analyzed with respect to the effect of communication parameters of the architecture. Experimental results on a 16node Intel 386based iPSC/2 hypercube are presented and discussed in Section V. Index TermsFinite element method, granularity, hypercube, linear equations, parallel algorithms I.
A Study of the Factorization FillIn for a Parallel Implementation of the Finite Element Method
 Int. J. Numer. Meth. Engng
, 1994
"... In this paper we investigate the additional storage overhead needed for a parallel implementation of finite element applications. In particular, we compare the storage requirements for the factorization of the sparse matrices that would occur on parallel processor versus a uniprocessor. This variati ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
In this paper we investigate the additional storage overhead needed for a parallel implementation of finite element applications. In particular, we compare the storage requirements for the factorization of the sparse matrices that would occur on parallel processor versus a uniprocessor. This variation in storage results from the factorization fillin. We address the question of whether the storage overhead is so large for parallel implementations that it imposes severe limitations on the problem size in contrast to the problems executed sequentially on a uniprocessor. The storage requirements for the parallel implementation is based upon a new ordering scheme, the Combination MeshBased scheme. This scheme uses a domain decomposition method which attempts to balance the processors' loads and decrease the interprocessor communication. The storage requirements for the sequential implementation is based upon the Minimum Degree algorithm. The difference between the two storage requirements...
Wavelet Bases Adapted to PseudoDifferential Operators
 Appl. Comp. Harm. Anal
, 1992
"... This paper is concerned with the numerical treatment of pseudodifferential equations in IR 2 , employing wavelet Galerkin methods. We construct wavelet bases adapted to a given pseudodifferential operator in the sense that functions on different refinement levels are orthogonal with respect to a ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
This paper is concerned with the numerical treatment of pseudodifferential equations in IR 2 , employing wavelet Galerkin methods. We construct wavelet bases adapted to a given pseudodifferential operator in the sense that functions on different refinement levels are orthogonal with respect to a certain bilinear form induced by the operator. Key words: wavelets, biorthogonal bases, pseudodifferential operators, Galerkin methods. AMS subject classification: 42C05, 47G30, 65N30 1 Introduction Lately, newly developed wavelet decompositions were employed for the numerical treatment of partial differential equations, see e.g. [3, 4, 16, 17, 18, 23]. In general, a system of functions f/ i g i=1;:::;N is called a family of (mother) wavelets if the scaled and integer translated versions of f/ i g i=1;:::;N form an (orthonormal) basis of L 2 (IR n ). These functions can be utilized as basis functions for a Galerkin approach. Since the structure of the resulting stiffness matrix ...
Performance Considerations of Shared Virtual Memory Machines
 IEEE Transactions on Parallel and Distributed Systems
, 1995
"... Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, we s ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, we show that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory machines. A scientific application has been implemented on a KSR1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, an interesting relation between fixedtime and memorybounded speedup is revealed. Various causes of superlinear speedup are also presented. Manuscript received March 5, 1994; revised Nov. 14, 1994 and March 14...
Application and Accuracy of the Parallel Diagonal Dominant Algorithm
 Parallel Comput
, 1995
"... The Parallel Diagonal Dominant (PDD) algorithm is an efficient tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is extended to solve periodic tridiagonal systems and its scalability is studied. Then the reduced PDD algorithm, which has a smal ..."
Abstract

Cited by 10 (9 self)
 Add to MetaCart
The Parallel Diagonal Dominant (PDD) algorithm is an efficient tridiagonal solver. In this paper, a detailed study of the PDD algorithm is given. First the PDD algorithm is extended to solve periodic tridiagonal systems and its scalability is studied. Then the reduced PDD algorithm, which has a smaller operation count than that of the conventional sequential algorithm for many applications, is proposed. Accuracy analysis is provided for a class of tridiagonal systems, the symmetric and skewsymmetric Toeplitz tridiagonal systems. Implementation results show that the analysis gives a good bound on the relative error, and the PDD and reduced PDD algorithms are good candidates for emerging massively parallel machines. Index Terms: Parallel processing, Parallel numerical algorithms, Scalable computing, Tridiagonal systems, Toeplitz systems Manuscript received April 7, 1993; revised April 7, 1994 and January 27, 1995. This research was supported in part by the National Aeronautics and S...