Results 1  10
of
131
The parallel evaluation of general arithmetic expressions
 Journal of the ACM
, 1974
"... ABSTRACT. It is shown that arithmetic expressions with n> 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n 1)/p using p> 1 processors which can independently perform arithmetic operations in ..."
Abstract

Cited by 239 (1 self)
 Add to MetaCart
ABSTRACT. It is shown that arithmetic expressions with n> 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n 1)/p using p> 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed. KEY WORDS AND PHRASES: arithmetic expressions, compilation of arithmetic expressions, computational complexity, general arithmetic expressions, numerical stability, parallel computatioR,
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 193 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
InstructionLevel Parallel Processing: History, Overview and Perspective
, 1992
"... Instructionlevel Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract

Cited by 171 (0 self)
 Add to MetaCart
Instructionlevel Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
Prefix Sums and Their Applications
"... Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel ..."
Abstract

Cited by 95 (2 self)
 Add to MetaCart
Experienced algorithm designers rely heavily on a set of building blocks and on the tools needed to put the blocks together into an algorithm. The understanding of these basic blocks and tools is therefore critical to the understanding of algorithms. Many of the blocks and tools needed for parallel
Evaluating Compiler Optimizations For Fortran D
, 1994
"... The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributedmemory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empiric ..."
Abstract

Cited by 68 (4 self)
 Add to MetaCart
The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributedmemory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empirically evaluated for stencil computations. Communication optimizations reduce communication overhead by decreasing the number of messages and hide communication overhead by overlapping the cost of remaining messages with local computation. Parallelism optimizations exploit parallel and pipelined computations, and may need to restructure the computation to increase parallelism. Profitability formulas are derived for each optimization. Empirical results show that exploiting parallelism for pipelined computations, reductions, and scans is vital. Message vectorization, collective communication, and efficient coarsegrain pipelining also significantly affect performance. Scalability of communicatio...
Scan Primitives for Vector Computers
 In Proceedings Supercomputing '90
, 1990
"... This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. ..."
Abstract

Cited by 38 (9 self)
 Add to MetaCart
This paper describes an optimized implementation of a set of scan (also called allprefix sums) primitives on a single processor of a CRAY YMP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans is based on an algorithm for parallel computers and is applicable with minor modifications to any registerbased vector computer. On the CRAY YMP, the asymptotic running time of the plusscan is about 2.25 times that of a vector add, and is within 20% of optimal. An important aspect of our implementation is that a set of segmented versions of these scans are only marginally more expensive than the unsegmented versions. These segmented versions can be used to execute a scan on multiple data sets without having to pay the vector startup cost (n 1=2 ) for each set. The paper describes a radix sorting routine based on the scans that is 13 times faster ...
A Family of Adders
 In Proceedings of 14th IEEE Symposium on Computer Arithmetic
, 1999
"... Binary carrypropagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge & Stone and Ladner & Fischer. In this ..."
Abstract

Cited by 37 (0 self)
 Add to MetaCart
Binary carrypropagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge & Stone and Ladner & Fischer. In this work we show that these are end cases of a large family of addition structures, all of which share the attractive property of minimum logical depth. The intermediate structures allow tradeoffs between the amount of internal wiring and the fanout of intermediate nodes, and can thus usually achieve a more attractive combination of speed and area/power cost than either of the known endcases. Rules for the construction of such adders are given, as are examples of realistic 32b designs implemented in an industrial 0u25 CMOS process. 1. Introduction There are many ways of formulating the process of binary addition. Each different way provides different insight and thus suggests different impl...
Evaluation of Compiler Optimizations for Fortran D on MIMD DistributedMemory Machines
 IN PROCEEDINGS OF THE 1992 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING
, 1992
"... The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributedmemory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empiric ..."
Abstract

Cited by 33 (11 self)
 Add to MetaCart
The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributedmemory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve acceptable performance; they are analyzed and empirically evaluated for stencil computations. Profitability formulas are derived for each optimization. Results show that exploiting parallelism for pipelined computations, reductions, and scans is vital. Message vectorization, collective communication, and efficient coarsegrain pipelining also significantly affect performance.
Cost Reduction and Evaluation of a Temporary Faults Detecting Technique
, 2000
"... IC technologies are approaching the ultimate limits of silicon in terms of channel width, power supply and speed. By approaching these limits, circuits are becoming increasingly sensitive to noise, which will result on unacceptable rates of softerrors. Furthermore, defect behavior is becoming incre ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
IC technologies are approaching the ultimate limits of silicon in terms of channel width, power supply and speed. By approaching these limits, circuits are becoming increasingly sensitive to noise, which will result on unacceptable rates of softerrors. Furthermore, defect behavior is becoming increasingly complex resulting on increasing number of timing faults that can escape detection by fabrication testing. Thus, fault tolerant techniques will become necessary even for commodity applications. This work considers the implementation and improvements of a new soft error and timing error detecting technique based on time redundancy. Arithmetic circuits were used as test vehicle to validate the approach. Simulations and performance evaluations of the proposed detection technique were made using time and logic simulators. The obtained results show that detection of such temporal faults can be achieved by means of meaningful hardware and performance cost.
Methods for True EnergyPerformance Optimization
, 2004
"... This paper presents methods for efficient energyperformance optimization at the circuit and microarchitectural levels. The optimal balance between energy and performance is achieved when the sensitivity of energy to a change in performance is equal for all the design variables. The sensitivitybase ..."
Abstract

Cited by 31 (12 self)
 Add to MetaCart
This paper presents methods for efficient energyperformance optimization at the circuit and microarchitectural levels. The optimal balance between energy and performance is achieved when the sensitivity of energy to a change in performance is equal for all the design variables. The sensitivitybased optimizations minimize energy subject to a delay constraint. Energy savings of about 65% can be achieved without delay penalty with equalization of sensitivities to sizing, supply, and threshold voltage in a 64bit adder, compared to the reference design sized for minimum delay. Circuit optimization is effective only in the region of about 30% around the reference delay; outside of this region the optimization becomes too costly either in terms of energy or delay. Using optimal energydelay tradeoffs from the circuit level and introducing more degrees of freedom, the optimization is hierarchically extended to higher abstraction layers. We focus on the microarchitectural optimization and demonstrate that the scope of energyefficient optimization can be extended by the choice of circuit topology or the level of parallelism. In a 64bit ALU example, parallelism of five provides a threefold performance increase, while requiring the same energy as the reference design. Parallel or timemultiplexed solutions significantly affect the area of their respective designs, so the overall design cost is minimized when optimal energyarea tradeoff is achieved.