Results 1  10
of
97
Parallel Numerical Linear Algebra
, 1993
"... We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illust ..."
Abstract

Cited by 575 (26 self)
 Add to MetaCart
We survey general techniques and open problems in numerical linear algebra on parallel architectures. We first discuss basic principles of parallel processing, describing the costs of basic operations on parallel machines, including general principles for constructing efficient algorithms. We illustrate these principles using current architectures and software systems, and by showing how one would implement matrix multiplication. Then, we present direct and iterative algorithms for solving linear systems of equations, linear least squares problems, the symmetric eigenvalue problem, the nonsymmetric eigenvalue problem, and the singular value decomposition. We consider dense, band and sparse matrices.
The NPcompleteness column: an ongoing guide
 Journal of Algorithms
, 1985
"... This is the nineteenth edition of a (usually) quarterly column that covers new developments in the theory of NPcompleteness. The presentation is modeled on that used by M. R. Garey and myself in our book ‘‘Computers and Intractability: A Guide to the Theory of NPCompleteness,’ ’ W. H. Freeman & ..."
Abstract

Cited by 196 (0 self)
 Add to MetaCart
This is the nineteenth edition of a (usually) quarterly column that covers new developments in the theory of NPcompleteness. The presentation is modeled on that used by M. R. Garey and myself in our book ‘‘Computers and Intractability: A Guide to the Theory of NPCompleteness,’ ’ W. H. Freeman & Co., New York, 1979 (hereinafter referred to as ‘‘[G&J]’’; previous columns will be referred to by their dates). A background equivalent to that provided by [G&J] is assumed, and, when appropriate, crossreferences will be given to that book and the list of problems (NPcomplete and harder) presented there. Readers who have results they would like mentioned (NPhardness, PSPACEhardness, polynomialtimesolvability, etc.) or open problems they would like publicized, should
Models of Computation  Exploring the Power of Computing
"... Theoretical computer science treats any computational subject for which a good model can be created. Research on formal models of computation was initiated in the 1930s and 1940s by Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, language translators, and oper ..."
Abstract

Cited by 62 (5 self)
 Add to MetaCart
Theoretical computer science treats any computational subject for which a good model can be created. Research on formal models of computation was initiated in the 1930s and 1940s by Turing, Post, Kleene, Church, and others. In the 1950s and 1960s programming languages, language translators, and operating systems were under development and therefore became both the subject and basis for a great deal of theoretical work. The power of computers of this period was limited by slow processors and small amounts of memory, and thus theories (models, algorithms, and analysis) were developed to explore the efficient use of computers as well as the inherent complexity of problems. The former subject is known today as algorithms and data structures, the latter computational complexity. The focus of theoretical computer scientists in the 1960s on languages is reflected in the first textbook on the subject, Formal Languages and Their Relation to Automata by John Hopcroft and Jeffrey Ullman. This influential book led to the creation of many languagecentered theoretical computer science courses; many introductory theory courses today continue to reflect the content of this book and the interests of theoreticians of the 1960s and early 1970s. Although
Communication Lower Bounds for DistributedMemory Matrix Multiplication
, 2004
"... this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
this paper. More speci cally, we use the de nitions of [10]: (g(n)) is the set of functions f(n) such that there exist positive constants c 1 , c2 , and n0 such that 0 c1 g(n) f(n) c2 g(n) for all n n0 ; O(g(n)) is de ned similarly using the weaker condition 0 f(n) c 2 g(n); g(n)) is de ned with the condition 0 c 1 g(n) f(n). The set o(g(n)) consists of functions f(n) such that for any c 2 > 0 there exists a constant n0 > 0 such that 0 f(n) c 2 g(n) for all n n0
Optimal Communication Algorithms for Hypercubes
 Journal of Parallel and Distributed Computing
, 1991
"... We consider the following basic communication problems in a hypercube network of processors: the problem of a single processor sending a different packet to each of the other processors, the problem of simultaneous broadcast of the same packet from every processor to all other processors, and the pr ..."
Abstract

Cited by 43 (2 self)
 Add to MetaCart
We consider the following basic communication problems in a hypercube network of processors: the problem of a single processor sending a different packet to each of the other processors, the problem of simultaneous broadcast of the same packet from every processor to all other processors, and the problem of simultaneous exchange of different packets between every pair of processors. The algorithms proposed for these problems are optimal in terms of execution time and communication resource requirements; that is, they require the minimum possible number of time steps and packet transmissions. In contrast, algorithms in the literature are optimal only within an additive or multiplicative factor. @ 1991 Academic Press,lnc. 263 to the coordinates of a node of the dcube is referred to as the identity number of the node. We recall that a hypercube of any dimension can be constructed by connecting lowerdimensional cubes, starting with a lcube. In particular, we can start with two (dI)dimensional cubes and introduce a link connecting each pair of nodes with the same identity number (see, e.g., [I, Sect. 1.3]). This constructs adcube with the identity number of each node obtained by adding a leading 0 or a leading I to its previous identity, depending on whether the node belongs to the first (dI)dimensional cube or the second (see Fig. I). When confusion cannot arise, we refer to a dcube node interchangeably in terms of its identity number (a binary string of length d) and in terms of the decimal representation of its identity number. Thus,
Programming a Hypercube Multicomputer
, 1988
"... We describe those features of distributed memory MIMD hypercube multicomputers that are necessary to obtain efficient programs. Several examples are developed. These illustrate the effectiveness of different programming strategies. ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
We describe those features of distributed memory MIMD hypercube multicomputers that are necessary to obtain efficient programs. Several examples are developed. These illustrate the effectiveness of different programming strategies.
Sparse Matrix Computations on Parallel Processor Arrays
 SIAM J. SCI. COMPUT
, 1992
"... We investigate the balancing of distributed compressed storage of large sparse matrices on a massively parallel computer. For fast computation of matrixvector and matrixmatrix products on a rectangular processor array with efficient communications along its rows and columns we require that the non ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
We investigate the balancing of distributed compressed storage of large sparse matrices on a massively parallel computer. For fast computation of matrixvector and matrixmatrix products on a rectangular processor array with efficient communications along its rows and columns we require that the nonzero elements of each matrix row or column be distributed among the processors located within the same array row or column, respectively. We construct randomized packing algorithms with such properties, and we prove that with high probability they produce wellbalanced storage for sufficiently large matrices with bounded number of nonzeros in each row and column, but no other restrictions on structure. Then we design basic matrixvector multiplication routines with fully parallel interprocessor communications and intraprocessor gather and scatter operations. Their efficiency is demonstrated on the 16,384processor MasPar computer.
Communicationoptimal parallel 2.5D matrix multiplication and LU factorization algorithms
"... One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p pr ..."
Abstract

Cited by 26 (16 self)
 Add to MetaCart
One can use extra memory to parallelize matrix multiplication by storing p 1/3 redundant copies of the input matrices on p processors in order to do asymptotically less communication than Cannon’s algorithm [2], and be faster in practice [1]. We call this algorithm “3D ” because it arranges the p processors in a 3D array, and Cannon’s algorithm “2D ” because it stores a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈{1, 2,..., ⌊p 1/3 ⌋}, to reduce the bandwidth cost of Cannon’s algorithm by a factor of c 1/2 and the latency cost by a factor c 3/2. We also show that these costs reach the lower bounds [13, 3], modulo polylog(p) factors. We similarly generalize LU decomposition to 2.5D and 3D, including communicationavoiding pivoting, a stable alternative to partialpivoting [7]. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c 1/2, the latency must increase by a factor of c 1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. Preliminary results of 2.5D matrix multiplication on a Cray XT4 machine also demonstrate a performance gain of up to 3X with respect to Cannon’s algorithm. Careful choice of c also yields up to a 2.4X speedup over 3D matrix multiplication, due to a better balance between communication costs.
Minimizing the Communication Time for Matrix Multiplication on Multiprocessors
 PARALLEL COMPUTING
, 1992
"... We present one matrix multiplication algorithm for twodimensional arrays of processing nodes, and one algorithm for threedimensional nodal arrays. Onedimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high deg ..."
Abstract

Cited by 26 (9 self)
 Add to MetaCart
We present one matrix multiplication algorithm for twodimensional arrays of processing nodes, and one algorithm for threedimensional nodal arrays. Onedimensional nodal arrays are treated as a degenerate case. The algorithms are designed to utilize fully the communications bandwidth in high degree networks in which the one, two, or threedimensional arrays may be embedded. For binary ncubes, our algorithms offer a speedup of the communication over previous algorithms for square matrices and square twodimensional arrays by a factor of n 2 . Configuring the N = 2 n processing nodes as a threedimensional array may reduce the communication complexity by a factor of N 1 6 compared to a twodimensional nodal array. The threedimensional algorithm requires temporary storage proportional to the length of the nodal array axis aligned with the axis shared between the multiplier and the multiplicand. The optimal twodimensional nodal array shape with respect to communicati...
Efficient Parallel Algorithms for Computing All Pair Shortest Paths in Directed Graphs
, 1997
"... . We present parallel algorithms for computing all pair shortest paths in directed graphs. Our algorithm has time complexity O( f (n)/p + I (n) log n) on the PRAM using p processors, where I (n) is log n on the EREW PRAM, log log n on the CCRW PRAM, f (n) is o(n 3 ). On the randomized CRCW PRAM we a ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
. We present parallel algorithms for computing all pair shortest paths in directed graphs. Our algorithm has time complexity O( f (n)/p + I (n) log n) on the PRAM using p processors, where I (n) is log n on the EREW PRAM, log log n on the CCRW PRAM, f (n) is o(n 3 ). On the randomized CRCW PRAM we are able to achieve time complexity O(n 3 /p + log n) using p processors. Key Words. Analysis of algorithms, Design of algorithms, Parallel algorithms, Graph algorithms, Shortest path. 1. Introduction. A number of known algorithms compute the all pair shortest paths in graphs and digraphs with n vertices by using O(n 3 ) operations [D], [Fl], [J]. All these algorithms, however, use at least n1 recursive steps in the worst case and thus require at least the order of n time in their parallel implementation, even if the number of available processors is not bounded. O(n) time and n 2 processor bounds can indeed be achieved, for instance, in the straightforward parallelization of th...