Results 1 - 10
of
16
DAISY: Dynamic Compilation for 100% Architectural Compatibility
, 1997
"... Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instructi ..."
Abstract
-
Cited by 173 (12 self)
- Add to MetaCart
Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorlaown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Firtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordedng of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.
Characterizing the Impact of Predicated Execution on Branch Prediction
, 1994
"... Branch instructions are recognized as a major impediment to exploiting instruction level parallelism. Even with sophisticated branch prediction techniques, many frequently executed branches remain difficult to predict. An architecture supporting predicated execution may allow the compiler to remove ..."
Abstract
-
Cited by 47 (9 self)
- Add to MetaCart
Branch instructions are recognized as a major impediment to exploiting instruction level parallelism. Even with sophisticated branch prediction techniques, many frequently executed branches remain difficult to predict. An architecture supporting predicated execution may allow the compiler to remove many of these hard-to-predict branches, reducing the number of branch mispredictions and thereby improving performance. We present an in-depth analysis of the characteristics of those branches which are frequently mispredicted and examine the effectiveness of an advanced compiler to eliminate these branches. Over the benchmarks studied, an average of 27% of the dynamic branches and 56% of the dynamic branch mispredictions are eliminated with predicated execution support.
Exploiting Instruction Level Parallelism in the Presence of Conditional Branches
, 1996
"... Wide issue superscalar and VLIW processors utilize instruction-level parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
Wide issue superscalar and VLIW processors utilize instruction-level parallelism (ILP) to achieve high performance. However, if insufficient ILP is found, the performance potential of these processors suffers dramatically. Branch instructions, which are one of the major limitations to exploiting ILP, enforce strict ordering conditions in programs to ensure correct execution. Therefore, it is difficult to achieve the desired overlap of instruction execution with branches in the instruction stream. To effectively exploit ILP in the presence of branches requires efficient handling of branches and the dependences they impose. This dissertation investigates two techniques for exposing and enhancing ILP in the presence of branches, speculative execution and predicated execution. Speculative execution enables an ILP compiler to remove dependences between instructions and prior branches. In this manner, the execution of instructions and predicted future instructions may be overlapped. Compiler-controlled speculative execution is employed using an efficient structure called the superblock. The formation and optimization of superblocks increase ILP along important execution paths by systematically removing constraints due to unimportant paths. In conjunction with superblock optimizations, speculative execution is utilized to remove control dependences in the superblock
Architectural Support for Compiler-Synthesized Dynamic Branch Prediction Strategies: Rationale and Initial Results
- In Proceedings of the Third International Symposium on High-Performance Computer Architecture
, 1997
"... This paper introduces a new architectural approach that supports compiler-synthesized dynamic branch predication. In compiler-synthesized dynamic branch prediction, the compiler generates code sequences that, when executed, digest relevant state information and execution statistics into a condition ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
This paper introduces a new architectural approach that supports compiler-synthesized dynamic branch predication. In compiler-synthesized dynamic branch prediction, the compiler generates code sequences that, when executed, digest relevant state information and execution statistics into a condition bit, or predicate. The hardware then utilizes this information to make predictions. Two categories of such architectures are proposed and evaluated. In Predicate Only Prediction (POP), the hardware simply uses the condition generated by the code sequence as a prediction. In Predicate Enhanced Prediction (PEP), the hardware uses the generated condition to enhance the accuracy of conventional branch prediction hardware. The IMPACT compiler currently provides a minimal level of compiler support for the proposed approach. Experiments based on current predicated code show that the proposed predictors achieve better performance than conventional branch predictors. Furthermore, they enable future c...
Bitwidth Cognizant Architecture Synthesis of Custom Hardware Accelerators
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
, 2001
"... applicationspecific design, architecture synthesis, bitwidth, clustering, embedded system, hardware accelerator, operation scheduling, resource allocation PICO is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key iss ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
applicationspecific design, architecture synthesis, bitwidth, clustering, embedded system, hardware accelerator, operation scheduling, resource allocation PICO is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key issue confronted when designing such accelerators is the optimization of hardware by exploiting information that is known about the varying number of bits required to represent and process operands. In this paper, we describe the handling and exploitation of integer bitwidth in PICO. A bitwidth analysis procedure is used to determine bitwidth requirements for all integer variables and operations in a C application. Given known bitwidths for all variables, complex problems arise when determining a program schedule that specifies on which function unit and at what time each operation executes. If operations are assigned to function units with no knowledge of bitwidth, bitwidth-related cost benefit is lost when each unit is built to accommodate the widest operation assigned. By carefully placing operations of similar width on the same unit, hardware costs are decreased. This problem is addressed using a preliminary clustering of operations that is based jointly on width and implementation cost. These clusters are then honored during resource allocation and operation scheduling to create an efficient widthconscious design. Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.
Compiler Technology for Future Microprocessors
- Proceedings of the IEEE
, 1995
"... Advances in hardware technology have made it possible for microprocessors to execute a large number of instructions concurrently (i.e., in parallel). These microprocessors take advantage of the opportunity to execute instructions in parallel to increase the execution speed of a program. As in other ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Advances in hardware technology have made it possible for microprocessors to execute a large number of instructions concurrently (i.e., in parallel). These microprocessors take advantage of the opportunity to execute instructions in parallel to increase the execution speed of a program. As in other forms of parallel processing, the performance of these microprocessors can vary greatly depending on the quality of the software. In particular, the quality of compilers can make an order of magnitude difference in performance. This paper presents a new generation of compiler technology that has emerged to deliver the large amount of instruction-level-parallelism that is already required by some current state-of-the-art microprocessors and will be required by more future microprocessors. We introduce critical components of the technology which deal with difficult problems that are encountered when compiling programs for a high degree of instruction-level-parallelism. We present examples to i...
REGION-BASED COMPILATION
, 1996
"... The increasing amount of instruction-level parallelism (ILP) required to fully utilize high issue-rate processors has forced the compiler to perform more aggressive analysis, optimization, parallelization and scheduling on the input programs. Yet, the compiler designer must scale back the use of agg ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
The increasing amount of instruction-level parallelism (ILP) required to fully utilize high issue-rate processors has forced the compiler to perform more aggressive analysis, optimization, parallelization and scheduling on the input programs. Yet, the compiler designer must scale back the use of aggressive transformations in order to contain compile time and memory usage. The root of the problem lies in the function-oriented framework assumed in conventional compilers. Traditionally the compilation process has been built using the function as a compilation unit, because the function provides a convenient partition of the program. However, the size and contents of a function may not provide the best environment for aggressive analysis and optimization. This dissertation presents a technique in which the compiler is allowed to repartition the program into more desirable compilation units, called regions. Placing the compiler in control of the size and contents of the compilation unit reduces the importance of the algorithmic complexity of the applied transformations, allowing more aggressive transformations to be applied while reducing compilation time. The region concept has been traditionally applied within an ILP compiler only in the context of code scheduling. This dissertation proposes extending the concept of region partitioning to
Enhancing Instruction Level Parallelism Through Compiler-Controlled Speculation
, 1995
"... ... depends on speculative support to achieve high performance [5]. Without speculative support, very little execution overlap between loop iterations is achieved. This dissertation discusses the problems that must be addressed to perform compile-time speculation for acyclic global scheduling, class ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
... depends on speculative support to achieve high performance [5]. Without speculative support, very little execution overlap between loop iterations is achieved. This dissertation discusses the problems that must be addressed to perform compile-time speculation for acyclic global scheduling, classi es existing speculation models based upon how they solve these problems and discusses two new compile-time or compiler-controlled speculation models- write-back suppression speculation and safe speculation.
ShiftQ: A bufferred interconnect for custom loop accelerators
, 2001
"... ShiftQs are hardware structures consisting of registers and switches which buffer and transport operands among function units within custom hardware loop accelerators. ShiftQs help minimize buffering and interconnect costs by customizing the hardware to the given schedule and by intelligent sharing ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
ShiftQs are hardware structures consisting of registers and switches which buffer and transport operands among function units within custom hardware loop accelerators. ShiftQs help minimize buffering and interconnect costs by customizing the hardware to the given schedule and by intelligent sharing of register and interconnect resources. This paper describes the ShiftQ schema and a method to automatically synthesize them from modulo-scheduled loops. Wealsoevaluate the cost savings by comparing them against traditional storage and interconnect mechanisms.
Modulo Scheduling for Control-Intensive General-Purpose Programs
, 1997
"... It is increasingly necessary for the compiler to overlap successive loop iterations in order to nd su cient instruction-level parallelism to e ectively utilize the resources of high-performance processors. Two competing methods have been developed for moving instructions across itera-tion boundaries ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
It is increasingly necessary for the compiler to overlap successive loop iterations in order to nd su cient instruction-level parallelism to e ectively utilize the resources of high-performance processors. Two competing methods have been developed for moving instructions across itera-tion boundaries: unrolling followed by global acyclic scheduling and software pipelining. This dissertation investigates modulo scheduling, a software pipelining technique. Much of the pre-vious work on modulo scheduling has targeted the relatively well-behaved loops in numeric programs. This dissertation develops new techniques that allow modulo scheduling to be ef-fectively applied to control-intensive non-numeric programs. These techniques overcome the restrictions imposed by problematic control ow and loop exits. This dissertation also demonstrates that unrolling-based optimization prior to scheduling improves the performance of modulo scheduled loops and is, in fact, necessary to allow modulo scheduling to surpass the performance of acyclic scheduling for control-intensive general-purpose programs. Modulo scheduling has the following advantages over the acyclic scheduling approach for control-intensive general-purpose programs. First, modulo scheduling increases performance by maintaining the overlap of loop iterations throughout the execution of the loop. Second,

