Results 11 - 20
of
35
Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism
- In Proceedings of the Conference on Programming Language Design and Implementation
, 1995
"... Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In contrast, balanced scheduling schedules instructio ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Traditional list schedulers order instructions based on an optimistic estimate of the load latency imposed by the hardware and therefore cannot respond to variations in memory latency caused by cache hits and misses on non-blocking architectures. In contrast, balanced scheduling schedules instructions based on an estimate of the amount of instruction-level parallelism in the program. By scheduling independent instructions behind loads based on what the program can provide, rather than what the implementation stipulates in the best case (i.e., a cache hit), balanced scheduling can hide variations in memory latencies more effectively. Since its success depends on the amount of instruction-level parallelism in the code, balanced scheduling should perform even better when more parallelism is available. In this study, we combine balanced scheduling with three compiler optimizations that increase instruction-level parallelism: loop unrolling, trace scheduling and cache locality analysis. Usi...
Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures
, 2004
"... Technology trends present new challenges for processor architectures and their instruction schedulers. Growing transistor density will increase the number of execution units on a single chip, and decreasing wire transmission speeds will cause long and variable on-chip latencies. These trends will se ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
Technology trends present new challenges for processor architectures and their instruction schedulers. Growing transistor density will increase the number of execution units on a single chip, and decreasing wire transmission speeds will cause long and variable on-chip latencies. These trends will severely limit the two dominant conventional architectures: dynamic issue superscalars, and static placement and issue VLIWs. We present a new execution model in which the hardware and static scheduler instead work cooperatively, called Static Placement Dynamic Issue (SPDI). This paper focuses on the static instruction scheduler for SPDI. We identify and explore three issues SPDI schedulers must consider---locality, contention, and depth of speculation. We evaluate a range of SPDI scheduling algorithms executing on an Explicit Data Graph Execution (EDGE) architecture. We find that a surprisingly simple one achieves an average of 5.6 instructions-per-cycle (IPC) for SPEC2000 64-wide issue machine, and is within 80% of the performance without on-chip latencies. These results suggest that the compiler is effective at balancing on-chip latency and parallelism, and that the division of responsibilities between the compiler and the architecture is well suited to future systems. 1
Optimally Profiling and Tracing
- In Proceedings of the Conference on Principles of Programming Languages
, 1992
"... This paper describes algorithms for inserting monitoring code to profile and trace programs. These algorithms greatly reduce the cost of measuring programs with respect to the commonly used technique of placing code in each basic block. Program profiling counts the number of times each basic block i ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
This paper describes algorithms for inserting monitoring code to profile and trace programs. These algorithms greatly reduce the cost of measuring programs with respect to the commonly used technique of placing code in each basic block. Program profiling counts the number of times each basic block in a program executes. Instruction tracing records the sequence of basic blocks traversed in a program execution. The algorithms optimize the placement of counting/tracing code with respect to the expected or measured frequency of each block or edge in a program’s control-flow graph. We have implemented the algorithms in a profiling/tracing tool, and they substantially reduce the overhead of profiling and tracing. We also define and study the hierarchy of profiling problems. These problems have two dimensions: what is profiled (i.e., vertices (basic blocks) or edges in a control-flow graph) and where the instrumentation code is placed (in blocks or along edges). We compare the optimal solutions to the profiling problems and describe a new profiling problem: basic-block profiling with edge counters. This problem is important because an optimal solution to any other profiling problem (for a given control-flow graph) is never better than an optimal solution to this problem. Unfortunately, finding an optimal placement of edge counters for vertex profiling appears to be a
A Compiler for Application-Specific Signal Processors
, 1988
"... nd family deserve thanks for many things. Special thanks go to William Daly of Dumont High School, who encouraged my interest in computers and even hired me as a programmer. iii iv Contents 1 Introduction 1 1.1 Application-specic signal processors : : : : : : : : : : : : : : : : : : : : : : 1 1. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
nd family deserve thanks for many things. Special thanks go to William Daly of Dumont High School, who encouraged my interest in computers and even hired me as a programmer. iii iv Contents 1 Introduction 1 1.1 Application-specic signal processors : : : : : : : : : : : : : : : : : : : : : : 1 1.2 Generating horizontal microcode : : : : : : : : : : : : : : : : : : : : : : : : 5 2 Target Architectures 11 2.1 The Kappa datapath : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 2.2 The boolean and control units : : : : : : : : : : : : : : : : : : : : : : : : : : 14 3 The RL Compiler 16 3.1 RL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16 3.2 The register-transfer notation : : : : : : : : : : : : : : : : : : : : : : : : : : 25 3.3 The machine description : : : : : : :
An Aggressive Approach to Loop Unrolling
- Proc. Compiler Construction '96
, 1995
"... A well-known code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction execu ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
A well-known code transformation for improving the execution performance of a program is loop unrolling. The most obvious benefit of unrolling a loop is that the transformed loop usually, but not always, requires fewer instruction executions than the original loop. The reduction in instruction executions comes from two sources: the number of branch instructions executed is reduced, and the index variable is modified fewer times. In addition, for architectures with features designed to exploit instruction-level parallelism, loop unrolling can expose greater levels of instructionlevel parallelism. Loop unrolling is an effective code transformation often improving the execution performance of programs that spend much of their execution time in loops by ten to thirty percent. Possibly because of the effectiveness of a simple application of loop unrolling, it has not been studied as extensively as other code improvements such as register allocation or common subexpression elimination. The r...
Fast, Effective Dynamic Compilation
, 1996
"... Dynamic compilation enables optimizations based on the values of invariant data computed at run-time. Using the values of these runtime constants, a dynamic compiler can eliminate their memory loads, perform constant propagation and folding, remove branches they determine, and fully unroll loops the ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Dynamic compilation enables optimizations based on the values of invariant data computed at run-time. Using the values of these runtime constants, a dynamic compiler can eliminate their memory loads, perform constant propagation and folding, remove branches they determine, and fully unroll loops they bound. However, the performance benefits of the more efficient, dynamically-compiled code are offset by the run-time cost of the dynamic compile. Our approach to dynamic compilation strives for both fast dynamic compilation and high-quality dynamically-compiled code: the programmer annotates regions of the programs that should be compiled dynamically; a static, optimizing compiler automatically produces pre-optimized machine-code templates, using a pair of dataflow analyses that identify which variables will be constant at run-time; and a simple, dynamic compiler copies the templates, patching in the computed values of the run-time constants, to produce optimized, executable code. Our work...
Allocating Registers in Multiple Instruction-Issuing Processors
- In Proceedings of the IFIP WG 10.3 Working Conference on Parallel Architectures and Compilation Techniques, PACT'95
, 1995
"... : This work addresses the problem of scheduling a basic block of operations on a multiple instruction-issuing processor. We show that integrating register constraints into operation sequencing algorithms is a complex problem in itself. Indeed, while scheduling a forest of unit time operations on a p ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
: This work addresses the problem of scheduling a basic block of operations on a multiple instruction-issuing processor. We show that integrating register constraints into operation sequencing algorithms is a complex problem in itself. Indeed, while scheduling a forest of unit time operations on a processor with P parallel instruction slots can be solved in polynomial time, the problem becomes NP-hard when P is unbounded but only R registers are available. As a result we have devised a concise integer linear programming formulation of this scheduling problem that accounts for both register and instruction issuing constraints. This allows the use of off-the-shelf routines to find optimum solutions, which can then be compared with the results obtained by polynomial-time heuristics. Two such heuristics are given, and their combined results are shown to be optimal in 99.5% of the cases for trees of height at most 6. A byproduct of these experiments is to show that our integer programming f...
Exploiting Multi-Grained Parallelism For Multiple-Instruction-Stream Architectures
, 1997
"... Exploiting parallelism is an essential part of maximizing the performance of an application on a parallel computer. Parallelism is traditionally exploited at two granularities: individual operations are executed in parallel within a processor to exploit instruction-level parallelism and loop iterati ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Exploiting parallelism is an essential part of maximizing the performance of an application on a parallel computer. Parallelism is traditionally exploited at two granularities: individual operations are executed in parallel within a processor to exploit instruction-level parallelism and loop iterations or processes are executed in parallel on different processors to exploit loop-level parallelism and process-level parallelism. A new generation of architectures that execute multiple instruction streams on a single chip has the potential of significantly reducing the gap between communication costs within a processor and between processors. This means that parallelism of multiple granularities can be exploited between instruction streams by overlapping regions of code that range in granularity from a small set of instructions to basic blocks, conditionals, loop iterations, loop nests, procedure calls, and collections of such constructs. This opens the way to exploiting more parallelism i...
The Use of Control-Flow and Control Dependence in Software Tools
, 1993
"... Program development, debugging, and maintenance can be greatly improved by the use of software tools that provide information about program behavior. This thesis focuses on a number of useful software tools and shows how their efficiency, generality, and precision can be increased through the use of ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Program development, debugging, and maintenance can be greatly improved by the use of software tools that provide information about program behavior. This thesis focuses on a number of useful software tools and shows how their efficiency, generality, and precision can be increased through the use of control-flow and control dependence analysis. We consider two classes of tools: execution measurement tools, which collect information about a particular program execution; and program analysis tools, which provide information about potential program behavior by statically analyzing the program. We consider three tools that measure aspects of a program's execution: profiling, tracing, and event counting tools. We describe algorithms for profiling and tracing programs that use a combination of control-flow analysis and program instrumentation to produce exact profiles and traces with low run-time overhead. Rather than record information at every point in a program, the algorithms record info...
Run-time versus Compile-time Instruction Scheduling in Superscalar (RISC) Processors: Performance and Tradeoffs
- In: Proceedings of the third International Conference of High Performance Computing, ACM
, 1995
"... The RISC revolution has spurred the development of processors with increasing degrees of instruction level parallelism (ILP). In order to realize the full potential of these processors, multiple instructions must continuouslybe issued and executed in a single cycle. Consequently, instruction schedu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The RISC revolution has spurred the development of processors with increasing degrees of instruction level parallelism (ILP). In order to realize the full potential of these processors, multiple instructions must continuouslybe issued and executed in a single cycle. Consequently, instruction scheduling plays a crucial role as an optimization in this context. While early attempts at instruction scheduling were limited to compile-time approaches, the current trends are aimed at providing dynamic support in hardware. In this paper, we present the results of a detailed comparative study of the performance advantages to be derived by the spectrum of instruction scheduling approaches: from limited basic-block schedulers in the compiler, to novel and aggressive schedulers in hardware. A significant portion of our experimental study via simulations, is devoted to understanding the performance advantages of run-time scheduling. Our results indicate it to be effective in extracting the ILP inh...

