Results 1 -
9 of
9
The parallel evaluation of general arithmetic expressions
- Journal of the ACM
, 1974
"... ABSTRACT. It is shown that arithmetic expressions with n> 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n- 1)/p using p> 1 processors which can independently perform arithmetic operations in ..."
Abstract
-
Cited by 227 (1 self)
- Add to MetaCart
ABSTRACT. It is shown that arithmetic expressions with n> 1 variables and constants; operations of addition, multiplication, and division; and any depth of parenthesis nesting can be evaluated in time 4 log2n + 10(n- 1)/p using p> 1 processors which can independently perform arithmetic operations in unit time. This bound is within a constant factor of the best possible. A sharper result is given for expressions without the division operation, and the question of numerical stability is discussed. KEY WORDS AND PHRASES: arithmetic expressions, compilation of arithmetic expressions, compu-tational complexity, general arithmetic expressions, numerical stability, parallel computatioR,
Exploiting Instruction Level Parallelism in the Presence of Conditional Branches
, 1996
"... Wide issue superscalar and VLIW processors utilize instruction-level parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
Wide issue superscalar and VLIW processors utilize instruction-level parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP, enforce strict ordering conditions in programs to ensure correct execution. Therefore, it is difficult to achieve the desired overlap of instruction execution with branches in the instruction stream. To effectively exploit ILP in the presence of branches requires efficient handling of branches and the dependences they impose. This dissertation investigates two techniques for exposing and enhancing ILP in the presence of branches, speculative execution and predicated execution. Speculative execution enables an ILP compiler to remove dependences between instructions and prior branches. In this manner, the execution of instructions and predicted future instructions may be overlapped. Compiler-controlled speculative execution is employed using an efficient structure called the superblock. The formation and optimization of superblocks increase ILP along important execution paths by systematically removing constraints due to unimportant paths. In conjunction with superblock optimizations, speculative execution is utilized to remove control dependences in the superblock
Compiler Code Transformations for Superscalar-Based High-Performance Systems
- in Proceedings of Supercomputing '92
, 1992
"... Exploiting parallelism at both the multiprocessor level and the instruction level is an effective means for supercomputers to achieve high-performance. The amount of instruction-level parallelism available to superscalar or VLIW node processors can be limited, however, with conventional compiler opt ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
Exploiting parallelism at both the multiprocessor level and the instruction level is an effective means for supercomputers to achieve high-performance. The amount of instruction-level parallelism available to superscalar or VLIW node processors can be limited, however, with conventional compiler optimization techniques. In this paper, a set of compiler transformations designed to increase instruction-level parallelism is described. The effectiveness of these transformations is evaluated using 40 loop nests extracted from a range of supercomputer applications. This evaluation shows that increasing execution resources in superscalar /VLIW node processors yields little performance improvement unless loop unrolling and register renaming are applied. It also reveals that these two transformations are sufficient for DOALL loops. However, more advanced transformations are required in order for serial and DOACROSS loops to fully benefit from the increased execution resources. The results show ...
Optimizations and Oracle Parallelism with Dynamic Translation
- In Proc. 32nd International Symposium on Microarchitecture
, 1999
"... We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-con ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms [9]. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.
Integrating Program Transformations in the Memory-Based Synthesis of Image and Video Algorithms
- In Proc. of the IEEE Intl. Conf. on Computer-Aided Design (ICCAD
, 1994
"... In this paper we discuss the interaction and integration of two important program transformations in high-level synthesis---Tree Height Reduction and Redundant Memory-access Elimination. Intuitively, these program transformations do not interfere with one another as they optimize different operation ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
In this paper we discuss the interaction and integration of two important program transformations in high-level synthesis---Tree Height Reduction and Redundant Memory-access Elimination. Intuitively, these program transformations do not interfere with one another as they optimize different operations in the program graph and different resources in the synthesized system. However, we demonstrate that integration of the two tasks is necessary to better utilize available resources. Our approach involves the use of a "meta-transformation" to guide transformation application as possibilities arise. Results observed on several image and video benchmarks demonstrate that transformation integration increases performance through better resource utilization. 1 Introduction Tree height reduction (THR) [1, 12] is a well-known technique for reducing the critical path length and increasing the parallelism of expressions and/or recurrences through the introduction of redundant computation. THR has ...
An Efficient Framework For Performing Execution-Constraint-Sensitive Transformations That Increase Instruction-Level Parallelism
, 1997
"... The increasing amount of instruction-level parallelism required to fully utilize high issue-rate processors forces the compiler to perform increasingly advanced transformations, many of which require adding extra operations in order to remove those dependences constraining performance. Although aggr ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The increasing amount of instruction-level parallelism required to fully utilize high issue-rate processors forces the compiler to perform increasingly advanced transformations, many of which require adding extra operations in order to remove those dependences constraining performance. Although aggressive application of these transformations is necessary in order to realize the full performance potential, overly-aggressive application can negate their benefit or even degrade performance. This thesis investigates a general framework for applying these transformations at schedule time, which is typically the only time the processor's execution constraints are visible to the compiler. Feedback from the instruction scheduler is then used to aggressively and intelligently apply these transformations. This results in consistently better performance than traditional application methods because the application of transformations can now be more fully adapted to the processor's execution constraints. Techniques for optimizing the processor's machine description for efficient use by the scheduler, and for incrementally updating the dependence graph after performing each transformation, allow the utilization of scheduler feedback with relatively small compile-time overhead. iii
Timing-Driven Logic Bi-Decomposition
, 2003
"... An approach for logic decomposition that produces circuits with reduced logic depth is presented. It combines two strategies: logic bi-decomposition of Boolean functions and tree-height reduction of Boolean expressions. It is a technology-independent approach that enables one to find tree-like expre ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
An approach for logic decomposition that produces circuits with reduced logic depth is presented. It combines two strategies: logic bi-decomposition of Boolean functions and tree-height reduction of Boolean expressions. It is a technology-independent approach that enables one to find tree-like expressions with smaller depths than the ones obtained by state-of-the-art techniques. The approach can also be combined with technology mapping techniques aiming at timing optimization. Experimental results show that new points in the area/delay space can be explored, with tangible delay improvements when compared to existing techniques.
Bi-Decomposition and Tree-Height Reduction for Timing Optimization
- Proc. IWLS ’02
"... A novel approach for timing-driven logic decomposition is presented. It is based on the combination of two strategies: logic bi-decomposition of Boolean functions and treeheight reduction of Boolean expressions. This technologyindependent approach allows to find tree-like expressions with smaller de ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A novel approach for timing-driven logic decomposition is presented. It is based on the combination of two strategies: logic bi-decomposition of Boolean functions and treeheight reduction of Boolean expressions. This technologyindependent approach allows to find tree-like expressions with smaller depths than the ones obtained by state-of-theart techniques. Experimental results show an average delay reduction of more than 20% with regard to speed up in SIS.
Optimal Huffman Tree-Height Reduction for Instruction-Level Parallelism
"... Exposing and exploiting instruction-level parallelism (ILP) is a key component of high perfor-mance for modern processors. For example, wide-issue superscalar, VLIW, and dataflow processors only attain high performance when they execute nearby instructions in parallel. This paper shows how to use an ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Exposing and exploiting instruction-level parallelism (ILP) is a key component of high perfor-mance for modern processors. For example, wide-issue superscalar, VLIW, and dataflow processors only attain high performance when they execute nearby instructions in parallel. This paper shows how to use and modify the Huffman coding tree weight minimization algorithm to expose ILP. We apply Huffman to two problems: (1) tree height reduction–rewriting expression trees of com-mutative and associative operations to minimize tree height and expose ILP; and (2) software fanout–generating software fanout trees to forward values to multiple consumers in a dataflow ISA. Huffman yields two improvements over prior work on tree height reduction: (1) it produces glob-ally optimal trees even when expressions store intermediate values; and (2) it groups and folds constants. For fanout, we weigh the targets by the length of the critical path from the target to the end of its block. Given perfect weights, the compiler can minimize the latency of the tree using Hartley and Casavant’s modification to the Huffman algorithm. Experimental results show that these algorithms have practical benefits, providing modest but interesting improvements over prior work for exposing ILP. 1.

