Results 1  10
of
13
The parallel evaluation of general arithmetic expressions
 Journal of the ACM
, 1974
"... ABSTRACT. It is shown that arithmetic expressions with n> 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n 1)/p using p> 1 processors which can independently perform arithmetic operati ..."
Abstract

Cited by 273 (1 self)
 Add to MetaCart
ABSTRACT. It is shown that arithmetic expressions with n> 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n 1)/p using p> 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed. KEY WORDS AND PHRASES: arithmetic expressions, compilation of arithmetic expressions, computational complexity, general arithmetic expressions, numerical stability, parallel computatioR,
Exploiting Instruction Level Parallelism in the Presence of Conditional Branches
, 1996
"... Wide issue superscalar and VLIW processors utilize instructionlevel parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP ..."
Abstract

Cited by 43 (2 self)
 Add to MetaCart
(Show Context)
Wide issue superscalar and VLIW processors utilize instructionlevel parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP, enforce strict ordering conditions in programs to ensure correct execution. Therefore, it is difficult to achieve the desired overlap of instruction execution with branches in the instruction stream. To effectively exploit ILP in the presence of branches requires efficient handling of branches and the dependences they impose. This dissertation investigates two techniques for exposing and enhancing ILP in the presence of branches, speculative execution and predicated execution. Speculative execution enables an ILP compiler to remove dependences between instructions and prior branches. In this manner, the execution of instructions and predicted future instructions may be overlapped. Compilercontrolled speculative execution is employed using an efficient structure called the superblock. The formation and optimization of superblocks increase ILP along important execution paths by systematically removing constraints due to unimportant paths. In conjunction with superblock optimizations, speculative execution is utilized to remove control dependences in the superblock
Compiler Code Transformations for SuperscalarBased HighPerformance Systems
 in Proceedings of Supercomputing '92
, 1992
"... Exploiting parallelism at both the multiprocessor level and the instruction level is an effective means for supercomputers to achieve highperformance. The amount of instructionlevel parallelism available to superscalar or VLIW node processors can be limited, however, with conventional compiler opt ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
Exploiting parallelism at both the multiprocessor level and the instruction level is an effective means for supercomputers to achieve highperformance. The amount of instructionlevel parallelism available to superscalar or VLIW node processors can be limited, however, with conventional compiler optimization techniques. In this paper, a set of compiler transformations designed to increase instructionlevel parallelism is described. The effectiveness of these transformations is evaluated using 40 loop nests extracted from a range of supercomputer applications. This evaluation shows that increasing execution resources in superscalar /VLIW node processors yields little performance improvement unless loop unrolling and register renaming are applied. It also reveals that these two transformations are sufficient for DOALL loops. However, more advanced transformations are required in order for serial and DOACROSS loops to fully benefit from the increased execution resources. The results show ...
Optimizations and oracle parallelism with dynamic translation
 In Accepted for: Micro32
, 1999
"... We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more timecon ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
(Show Context)
We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more timeconsuming algorithms [9]. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.
Integrating Program Transformations in the MemoryBased Synthesis of Image and Video Algorithms
 In Proc. of the IEEE Intl. Conf. on ComputerAided Design (ICCAD
, 1994
"... In this paper we discuss the interaction and integration of two important program transformations in highlevel synthesisTree Height Reduction and Redundant Memoryaccess Elimination. Intuitively, these program transformations do not interfere with one another as they optimize different operation ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
(Show Context)
In this paper we discuss the interaction and integration of two important program transformations in highlevel synthesisTree Height Reduction and Redundant Memoryaccess Elimination. Intuitively, these program transformations do not interfere with one another as they optimize different operations in the program graph and different resources in the synthesized system. However, we demonstrate that integration of the two tasks is necessary to better utilize available resources. Our approach involves the use of a "metatransformation" to guide transformation application as possibilities arise. Results observed on several image and video benchmarks demonstrate that transformation integration increases performance through better resource utilization. 1 Introduction Tree height reduction (THR) [1, 12] is a wellknown technique for reducing the critical path length and increasing the parallelism of expressions and/or recurrences through the introduction of redundant computation. THR has ...
TimingDriven Logic BiDecomposition
, 2003
"... An approach for logic decomposition that produces circuits with reduced logic depth is presented. It combines two strategies: logic bidecomposition of Boolean functions and treeheight reduction of Boolean expressions. It is a technologyindependent approach that enables one to find treelike expre ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
An approach for logic decomposition that produces circuits with reduced logic depth is presented. It combines two strategies: logic bidecomposition of Boolean functions and treeheight reduction of Boolean expressions. It is a technologyindependent approach that enables one to find treelike expressions with smaller depths than the ones obtained by stateoftheart techniques. The approach can also be combined with technology mapping techniques aiming at timing optimization. Experimental results show that new points in the area/delay space can be explored, with tangible delay improvements when compared to existing techniques.
An Efficient Framework For Performing ExecutionConstraintSensitive Transformations That Increase InstructionLevel Parallelism
, 1997
"... The increasing amount of instructionlevel parallelism required to fully utilize high issuerate processors forces the compiler to perform increasingly advanced transformations, many of which require adding extra operations in order to remove those dependences constraining performance. Although aggr ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
The increasing amount of instructionlevel parallelism required to fully utilize high issuerate processors forces the compiler to perform increasingly advanced transformations, many of which require adding extra operations in order to remove those dependences constraining performance. Although aggressive application of these transformations is necessary in order to realize the full performance potential, overlyaggressive application can negate their benefit or even degrade performance. This thesis investigates a general framework for applying these transformations at schedule time, which is typically the only time the processor's execution constraints are visible to the compiler. Feedback from the instruction scheduler is then used to aggressively and intelligently apply these transformations. This results in consistently better performance than traditional application methods because the application of transformations can now be more fully adapted to the processor's execution constraints. Techniques for optimizing the processor's machine description for efficient use by the scheduler, and for incrementally updating the dependence graph after performing each transformation, allow the utilization of scheduler feedback with relatively small compiletime overhead.
BiDecomposition and TreeHeight Reduction for Timing Optimization
 Proc. IWLS ’02
"... A novel approach for timingdriven logic decomposition is presented. It is based on the combination of two strategies: logic bidecomposition of Boolean functions and treeheight reduction of Boolean expressions. This technologyindependent approach allows to find treelike expressions with smaller de ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
A novel approach for timingdriven logic decomposition is presented. It is based on the combination of two strategies: logic bidecomposition of Boolean functions and treeheight reduction of Boolean expressions. This technologyindependent approach allows to find treelike expressions with smaller depths than the ones obtained by stateoftheart techniques. Experimental results show an average delay reduction of more than 20% with regard to speed up in SIS.
Optimal Huffman TreeHeight Reduction for InstructionLevel Parallelism
"... Exposing and exploiting instructionlevel parallelism (ILP) is a key component of high performance for modern processors. For example, wideissue superscalar, VLIW, and dataflow processors only attain high performance when they execute nearby instructions in parallel. This paper shows how to use an ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Exposing and exploiting instructionlevel parallelism (ILP) is a key component of high performance for modern processors. For example, wideissue superscalar, VLIW, and dataflow processors only attain high performance when they execute nearby instructions in parallel. This paper shows how to use and modify the Huffman coding tree weight minimization algorithm to expose ILP. We apply Huffman to two problems: (1) tree height reduction–rewriting expression trees of commutative and associative operations to minimize tree height and expose ILP; and (2) software fanout–generating software fanout trees to forward values to multiple consumers in a dataflow ISA. Huffman yields two improvements over prior work on tree height reduction: (1) it produces globally optimal trees even when expressions store intermediate values; and (2) it groups and folds constants. For fanout, we weigh the targets by the length of the critical path from the target to the end of its block. Given perfect weights, the compiler can minimize the latency of the tree using Hartley and Casavant’s modification to the Huffman algorithm. Experimental results show that these algorithms have practical benefits, providing modest but interesting improvements over prior work for exposing ILP. 1.
Critical Path Delay and Net Delay Reduced
"... Abstract—In this paper, a technique for synthesizing binary tree structure of a nonregenerative logic circuit functionality is proposed, that achieves delay optimization by reducing the logic depth. It also helps in minimizing the resources needed to implement the logic tree structure with FPGA as ..."
Abstract
 Add to MetaCart
Abstract—In this paper, a technique for synthesizing binary tree structure of a nonregenerative logic circuit functionality is proposed, that achieves delay optimization by reducing the logic depth. It also helps in minimizing the resources needed to implement the logic tree structure with FPGA as target technology. Although it is a technology–independent scheme, it guarantees better results overall, even after the technologymapping phase. This is evident from the experimental results obtained and is due to the nature of the proposed heuristic. The practical results derived by targeting Spartan II (XC2S306PQ208) and Virtex II Pro (XC2VP27FG256) FPGA logic families show that there is an explicit maximum combinational path delay optimization by about 7.14%, on an average; reduction in maximum net delay by about 11.8 % and overall decrease in resource utilization by 44.07%, and mean savings in inputoutput buffer count by 43.57%, in comparison with the results corresponding to a recent scheme in literature [12], for both the target devices. Keywords—Synthesis, Delay optimization, Binary logic tree, Logic depth, Boolean distance.