Results 1 - 10
of
56
Communication-Efficient Parallel Algorithms for Distributed Random-Access Machines
- Algorithmica
, 1988
"... This paper introduces a model for parallel computation, called the distributed random-access machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
This paper introduces a model for parallel computation, called the distributed random-access machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network. We introduce the notion of a conservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to "shortcut" pointers in a data structure so that remote processors can communicate without causing undue congestion. We give O(lg n)-step, linear-processor, linear-space, conservative algorithms for a variety of problems on n- node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions ...
A Constructive Solution to the Juggling Problem in Systolic Array Synthesis
- In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’00
, 2000
"... We describe a new, practical, constructive method for solving the well-known conflict-free scheduling problem for the locally sequential, globally parallel (LSGP) case of systolic array synthesis. Previous solutions have an important practical disadvantage. Here we provide a closed form solution tha ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
We describe a new, practical, constructive method for solving the well-known conflict-free scheduling problem for the locally sequential, globally parallel (LSGP) case of systolic array synthesis. Previous solutions have an important practical disadvantage. Here we provide a closed form solution that enables the enumeration of all conflict-free schedules. The second part of the paper discusses reduction of the cost of hardware whose function is to control the flow of data, enable or disable functional units, and generate memory addresses. We present a new technique for controlling the complexity of these housekeeping functions in a systolic array.
Parallel algorithms in linear algebra
- Computer Sciences Laboratory, ANU
, 1991
"... This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributed-memory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
This paper provides an introduction to algorithms for fundamental linear algebra problems on various parallel computer architectures, with the emphasis on distributed-memory MIMD machines. To illustrate the basic concepts and key issues, we consider the problem of parallel solution of a nonsingular linear system by Gaussian elimination with partial pivoting. This problem has come to be regarded as a benchmark for the performance of parallel machines. We consider its appropriateness as a benchmark, its communication requirements, and schemes for data distribution to facilitate communication and load balancing. In addition, we describe some parallel algorithms for orthogonal (QR) factorization and the singular value decomposition (SVD). 1. Introduction – Gaussian
Advanced Copy Propagation for Arrays
- In Languages, Compilers, and Tools for Embedded Systems LCTES’03
, 2003
"... The focus of this paper is on a data flow-transformation called advanced copy propagation. After an array is assigned, we can, under certain conditions, replace a read from this array by the right hand side of the assignment. If so, the intermediate assignment can be skipped. In case it becomes dead ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
The focus of this paper is on a data flow-transformation called advanced copy propagation. After an array is assigned, we can, under certain conditions, replace a read from this array by the right hand side of the assignment. If so, the intermediate assignment can be skipped. In case it becomes dead code, it can be eliminated. Where necessary we distinguish between the different elements of arrays as well as the different runtime instances of statements, allowing us to do propagation over global loop and condition scopes. We have formalized two basic operations: non-recursive propagation that operates on two statements and recursive propagation that operates on one statement. A global algorithm uses these two operations to do propagation on code involving any number of statements. Running our prototype implementation on some multimedia kernels shows that we can get a decrease in memory acesses between 22% and 43%.
Efficient Hardware Data Mining with the Apriori Algorithm on FPGAs
"... The Apriori algorithm is a popular correlation-based datamining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architect ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The Apriori algorithm is a popular correlation-based datamining kernel. However, it is a computationally expensive algorithm and the running times can stretch up to days for large databases, as database sizes can extend to Gigabytes. Through the use of a new extension to the systolic array architecture, time required for processing can be significantly reduced. Our array architecture implementation on a Xilinx Virtex-II Pro 100 provides a performance improvement that can be orders of magnitude faster than the state-of-the-art software implementations. The system is easily scalable and introduces an efficient "systolic injection " method for intelligently reporting unpredictably generated mid-array results to a controller without any chance of collision or excessive stalling.
A Model of Computation for VLSI with Related Complexity Results
- Journal of the ACM
, 1985
"... Abstract. A new model of computation for VLSI, based on the assumption that time for propagating information is at least linear in the distance, is proposed. While accommodating for basic laws of physics, the model is designed to be general and technology independent. Thus, from a complexity viewpoi ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract. A new model of computation for VLSI, based on the assumption that time for propagating information is at least linear in the distance, is proposed. While accommodating for basic laws of physics, the model is designed to be general and technology independent. Thus, from a complexity viewpoint, it is especially suited for deriving lower bounds and trade-offs. New results for a number of problems, including fan-in, transitive functions, matrix multiplication, and sorting are presented. As regards upper bounds, it must be noted that, because of communication costs, the model clearly favors regular and pipelined architectures (e.g., systolic arrays).
Convergence of Iteration Systems
, 1993
"... An iteration system is a set of assignment statements whose computation proceeds in steps: at each step, an arbitrary subset of the statements is executed in parallel. The set of statements thus executed may differ at each step; however, it is required that each statement is executed infinitely ofte ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
An iteration system is a set of assignment statements whose computation proceeds in steps: at each step, an arbitrary subset of the statements is executed in parallel. The set of statements thus executed may differ at each step; however, it is required that each statement is executed infinitely often along the computation. The convergence of such systems (to a fixed point) is typically verified by showing that the value of a given variant function is decreased by each step that causes a state change. Such a proof requires an exponential number of cases (in the number of assignment statements) to be considered. In this paper, we present alternative methods for verifying the convergence of iteration systems. In most of these methods, upto a linear number of cases need to be considered. 1 Introduction Iteration systems are a useful abstraction for computational, physical and biological systems that involve "truly concurrent" events. In computing science, they can be used to represent se...
Faster sorting and routing on grids with diagonals
- Proceedings of the 11th Symposium on Theoretical Aspects of Computer Science, number 775 in Lecture Notes in Computer Science
, 1994
"... ..."
Floating Point Fault Tolerance with Backward Error Assertions
- IEEE Transactions on Computers
, 1995
"... This paper introduces an assertion scheme based on the backward error analysis for error detection in algorithms that solve dense systems of linear equations, Ax = b. Unlike previous methods, this Backward Error Assertion Model is specifically designed to operate in an environment of floating point ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
This paper introduces an assertion scheme based on the backward error analysis for error detection in algorithms that solve dense systems of linear equations, Ax = b. Unlike previous methods, this Backward Error Assertion Model is specifically designed to operate in an environment of floating point arithmetic subject to round-off errors, and can be easily instrumented in a Watchdog processor environment. The complexity of verifying assertions is O(n 2 ) compared to the O(n 3 ) complexity of algorithms solving Ax = b. Unlike other proposed error detection methods, this assertion model does not require any encoding of the matrix A . Experimental results under various error models are presented to validate the effectiveness of this assertion scheme. * The work of this author was supported in part by the National Science Foundation under Grant NSF CCR8821078. + The work of this author was supported in part by the Innovative Science and Technology Office of the Strategic Defense Initia...
A New Approach for Automatic Parallelization of Blocked Linear Algebra Computations
, 1991
"... This paper describes a new approach for automatic generation of efficient parallel programs from sequential blocked linear algebra programs. By exploiting recent progress in finegrain parallel architectures such as iWarp, and in libraries based on matrix-matrix block operations such as LAPACK, the a ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper describes a new approach for automatic generation of efficient parallel programs from sequential blocked linear algebra programs. By exploiting recent progress in finegrain parallel architectures such as iWarp, and in libraries based on matrix-matrix block operations such as LAPACK, the approach is expected to be effective in parallelizing a large class of linear algebra computations. An implementation of LAPACK on iWarp is under development. In the implementation, block routines are executed on the iWarp processor array using highly parallel systolic algorithms. Matrices are distributed over the array in a way that allows parallel block routines to be used wherever the original program calls a sequential block routine. This data distribution scheme significantly simplifies the process of parallelization, and as a result, efficient parallel versions of programs can be generated automatically. We discuss experiences and performance results from our preliminary implementation,...

