## Optimizations and Oracle Parallelism with Dynamic Translation (1999)

### Cached

### Download Links

- [www.research.ibm.com]
- [www.research.ibm.com]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. 32nd International Symposium on Microarchitecture |

Citations: | 19 - 5 self |

### BibTeX

@INPROCEEDINGS{Ebcioglu99optimizationsand,

author = {Kemal Ebcioglu and Erik Altman and Sumedh Sathaye and Michael Gschwind},

title = {Optimizations and Oracle Parallelism with Dynamic Translation},

booktitle = {In Proc. 32nd International Symposium on Microarchitecture},

year = {1999},

pages = {284--295},

publisher = {ACM, IEEE, ACM Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms [9]. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.

### Citations

1027 |
How to make a multiprocessor computer that correctly executes multiprocess programs
- Lamport
- 1979
(Show Context)
Citation Context ...also require that no loads occur from the address at 8(r1), so as to maintain sequential consistency, i.e. so that a later load in the original code does not get an earlier value than an earlier load =-=[16]-=-. We also observe that the original register (e.g. r31) may have its value available even earlier, by copy propagation, combining, or further load-store telescoping. In other words, these optimization... |

386 | Context-sensitive interprocedural points-to analysis in the presence of function pointers
- Emami, Ghiya, et al.
- 1994
(Show Context)
Citation Context ...ound on the overhead of our approach, whereas hardware approaches may have to deal with incorrect guesses repeatedly. The problem of scheduling through indirect branches is related to memory aliasing =-=[12, 13]-=- and in particular the problem of determining all possible call sites for a function, and the problem of trying to determine all possible functions an indirect call may invoke. However, these all invo... |

360 | Limits of instructionlevel parallelism
- Wall
(Show Context)
Citation Context ...(N ). In addition, our approach supports generation of the intermediate results needed for precise exceptions. Considerable work has also been done on the limits of parallelism and oracle parallelism =-=[29, 28, 15]-=-. None of this work however examined the effect of performing optimizations such as load store telescoping and combining while scheduling for oracle parallelism.5. Conclusion We have described the im... |

277 | Exceeding the dataflow limit via value prediction
- Lipasti, Shen
- 1996
(Show Context)
Citation Context ...ime constants for use in dynamic constant propagation, as well as other forms of value prediction. There has also been work on hardware structures which can expedite various forms of value prediction =-=[17, 19]-=-. However, we are not aware of any published work describing ILP extraction techniques efficient enough for use in a dynamic binary translation (DBT) framework. Although there is no hard and fast rule... |

238 | The parallel evaluation of general arithmetic expressions
- Brent
- 1974
(Show Context)
Citation Context ...arithmetic expression with N operations, uses associativity, commutativity, and distribution to build in O(N logN ) time, a tree of height at most d4log2(N ; 1)e;1 and using at most 3N function units =-=[3, 4]-=-. Brent’s algorithm could be applied multiple times so as to compute each intermediate result not generated in creating the final result. However, in addition to increasing the time required for tree-... |

92 | Streamlining Interoperation Memory Communication via data Dependence Prediction
- Moshovos, Sohi
- 1997
(Show Context)
Citation Context ...ime constants for use in dynamic constant propagation, as well as other forms of value prediction. There has also been work on hardware structures which can expedite various forms of value prediction =-=[17, 19]-=-. However, we are not aware of any published work describing ILP extraction techniques efficient enough for use in a dynamic binary translation (DBT) framework. Although there is no hard and fast rule... |

87 | Abstractions for recursive pointer data structures: Improving the analysis and transformation of imperative programs
- Hendren, Hummell, et al.
- 1992
(Show Context)
Citation Context ...ound on the overhead of our approach, whereas hardware approaches may have to deal with incorrect guesses repeatedly. The problem of scheduling through indirect branches is related to memory aliasing =-=[12, 13]-=- and in particular the problem of determining all possible call sites for a function, and the problem of trying to determine all possible functions an indirect call may invoke. However, these all invo... |

73 |
Programming Languages and Their Compilers," Preliminary notes, second revised version
- Cocke, Schwartz
- 1970
(Show Context)
Citation Context ...o compute the same xor result into both r63 and r62 in Figure 3(b). Unification notices this duplication and computes the result only once, as in Figure 3(c). Unification is a form of value numbering =-=[6]-=-, which is often employed in common subexpression elimination. With value numbering, as each expression is encountered, it is assigned a value number, typically an index into a hash table based on the... |

52 |
A new compilation technique for parallelizing loops with unpredictable branches on a VLIW architecture
- Ebcioglu, Nakatani
- 1989
(Show Context)
Citation Context ...erations required for tree height reduction is linear in the number of operations in the associative/commutative dependence chain. 2.7. Software Pipelining A variation of enhanced pipeline scheduling =-=[8]-=- can be used efficiently with dynamic binary translation. Figure 8 gives an example illustrating the use of the algorithm. Figure 8(a) depicts a simple PowerPC loop which counts in r6 the number of en... |

51 |
Some Design Ideas for a VLIW Architecture for Sequential Natured Software
- Ebcioglu
- 1988
(Show Context)
Citation Context ...s for those values. A related approach was described in [30] for use in the Embra simulation system. Unification has roots both in value numbering [6] and the scheduling work of Nicolau and Ebcio˘glu =-=[22, 7]-=-. The approach described here is global in scope and quite efficient, but not as general as the Nicolau and Ebcio˘glu approach. Software pipelining has a long history. Rau and Glaeser proposed modulo ... |

50 |
DAISY: Dynamic Compilation for 100
- Ebcioˇglu, Altman
- 1997
(Show Context)
Citation Context ...ation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms =-=[9]-=-. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism. 1. Background and Motivation Binary translation has attracte... |

48 |
The Java HotSpot performance engine architecture, http://www.java.sun.com/products/hotspot /whitepaper.html
- Microsystems
- 1999
(Show Context)
Citation Context ...ploy these optimizations in a dynamic binary translation system capable of computing oracle parallelism. 1. Background and Motivation Binary translation has attracted a great deal of attention lately =-=[5, 10, 11, 18, 25, 26, 27]-=-. Much (though not all) of this work has focused on functionally correct and efficient translation, as well as efficient translated code. There has been some work on optimizations uniquely suited to b... |

40 |
Mimic: A fast system/370 simulator
- May
- 1987
(Show Context)
Citation Context ...ploy these optimizations in a dynamic binary translation system capable of computing oracle parallelism. 1. Background and Motivation Binary translation has attracted a great deal of attention lately =-=[5, 10, 11, 18, 25, 26, 27]-=-. Much (though not all) of this work has focused on functionally correct and efficient translation, as well as efficient translated code. There has been some work on optimizations uniquely suited to b... |

27 |
Incremental tree height reduction for high level synthesis
- Nicolau, Potasman
- 1991
(Show Context)
Citation Context ...in performing its reduction. Brent [3] extended this approach by also making use of the distributive property. This early work applied only to operations in a single basic block. Nicolau and Potasman =-=[23]-=- proposed a tree height reduction technique capable of handling operations from multiple basic blocks. However, their algorithm was O(N 2 ) in the number of operations N, whereas our algorithm is O(N ... |

26 | An architectural framework for supporting heterogeneous instruction-set architectures
- Silberman, Ebcioglu
- 1993
(Show Context)
Citation Context ...ploy these optimizations in a dynamic binary translation system capable of computing oracle parallelism. 1. Background and Motivation Binary translation has attracted a great deal of attention lately =-=[5, 10, 11, 18, 25, 26, 27]-=-. Much (though not all) of this work has focused on functionally correct and efficient translation, as well as efficient translated code. There has been some work on optimizations uniquely suited to b... |

26 | Achieving High Performance via Co-designed Virtual Machines
- Smith, Heil, et al.
- 1999
(Show Context)
Citation Context |

22 | Execution-based scheduling for VLIW architectures
- Ebcioglu, Altman, et al.
- 1999
(Show Context)
Citation Context |

18 |
Embra: Fast and Flexible
- Witchel, Rosenblum
- 1996
(Show Context)
Citation Context ...void them. This can be accomplished for indirect branches by converting them into a series of conditional branches backstopped by an indirect branch. This is similar to the approach employed in Embra =-=[30]-=-. However, Embra checked only a single value for the indirect branch, whereas we check multiple values. For example, consider a PowerPC indirect branch blr, (Branch to Link Register) which, the first ... |

13 |
Run-time detection and recovery from incorrectly ordered memory operations
- Moudgill, Moreno
- 1997
(Show Context)
Citation Context ...y r4,r4’ copy r7,r7’ Even if condition (2) does not hold and an ambiguous intervening store or load does occur, load-store telescoping may be employed if the hardware supports a load-verify operation =-=[20]-=- or some other means to detect an correct memory aliasing when it occurs. Load-verify loads a value and compares it to the value in its “destination register”. This destination register typically hold... |

12 |
Compilation of arithmetic expressions for parallel computations
- Baer, Bovet
- 1968
(Show Context)
Citation Context ...closely related to enhanced pipeline scheduling [8], and again emphasizes an efficient implementation. A comprehensive tree height reduction algorithm was first proposed by proposed by Baer and Bovet =-=[2]-=-. This approach used only associativity and commutativity in performing its reduction. Brent [3] extended this approach by also making use of the distributive property. This early work applied only to... |

4 |
The PowerPC Microprocessor Family: The Programming Environments Manual for 32-Bit Microprocessors
- IBM, Motorola
(Show Context)
Citation Context ...Clearly the target architecture must have more registers than the original base architecture under this scheme. 12.1. Copy Propagation There are a variety of instructions in the PowerPC architecture =-=[14]-=-, and in most architectures for copying a value from one register to another. For example addi r3,r4,0 or r3,r4,r4 oril r3,r4,0 rlinm r3,r4,0,0,31 are all examples of operations which copy r4 to r3 in... |

1 |
On the Time Required to Parse an Arithmetic Expression for
- Brent, Towle
- 1976
(Show Context)
Citation Context ...arithmetic expression with N operations, uses associativity, commutativity, and distribution to build in O(N logN ) time, a tree of height at most d4log2(N ; 1)e;1 and using at most 3N function units =-=[3, 4]-=-. Brent’s algorithm could be applied multiple times so as to compute each intermediate result not generated in creating the final result. However, in addition to increasing the time required for tree-... |

1 |
Combining as a Compilation Technique for VLIW
- Nakatani, Ebcioglu
- 1989
(Show Context)
Citation Context ...e is very limited. More specifically: Copy propagation is a basic optimization described in [1]. Combining is a form of constant propagation [1], whose application for extracting ILP was described in =-=[21]-=-. Load-store telescoping of the sort we describe is unique to dynamic binary translation, since in order to be generally applicable, it must be possible to recover if Load-store telescoping turns out ... |

1 |
Percolation Scheduling: A Parallel Compilation
- Nicolau
- 1985
(Show Context)
Citation Context ...s for those values. A related approach was described in [30] for use in the Embra simulation system. Unification has roots both in value numbering [6] and the scheduling work of Nicolau and Ebcio˘glu =-=[22, 7]-=-. The approach described here is global in scope and quite efficient, but not as general as the Nicolau and Ebcio˘glu approach. Software pipelining has a long history. Rau and Glaeser proposed modulo ... |

1 |
Some Scheduling Techniquesand an Easily Schedulable Horizontal Architecture for High Performance
- Rau, Glaeser
- 1981
(Show Context)
Citation Context ...ach described here is global in scope and quite efficient, but not as general as the Nicolau and Ebcio˘glu approach. Software pipelining has a long history. Rau and Glaeser proposed modulo scheduling =-=[24]-=-. The approach here, however is much more closely related to enhanced pipeline scheduling [8], and again emphasizes an efficient implementation. A comprehensive tree height reduction algorithm was fir... |

1 |
On the Limits of Program Parallelism and its
- Theobald, Gao, et al.
- 1992
(Show Context)
Citation Context ...(N ). In addition, our approach supports generation of the intermediate results needed for precise exceptions. Considerable work has also been done on the limits of parallelism and oracle parallelism =-=[29, 28, 15]-=-. None of this work however examined the effect of performing optimizations such as load store telescoping and combining while scheduling for oracle parallelism.5. Conclusion We have described the im... |