Results 1 - 10
of
13
Instruction-Level Parallel Processing: History, Overview and Perspective
, 1992
"... Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a muc ..."
Abstract
-
Cited by 166 (0 self)
- Add to MetaCart
Instruction-level Parallelism CILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built, and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP have become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.
The Performance Impact of Incomplete Bypassing in Processor Pipelines
- In Proceedings of the 28th Annual International Symposium on Microarchitecture
, 1995
"... Pipelined processors employ hardware bypassing to eliminate certain pipeline hazards. Bypassing is logically simple but can be costly, especially in wide issue and deeply pipelined machines. In this paper bypassing is studied in detail, with an emphasis on designs in which the bypassing network is n ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
Pipelined processors employ hardware bypassing to eliminate certain pipeline hazards. Bypassing is logically simple but can be costly, especially in wide issue and deeply pipelined machines. In this paper bypassing is studied in detail, with an emphasis on designs in which the bypassing network is not complete. Cyclelevel simulations of a model of integer and floatingpoint pipelines running some of the SPEC92 benchmarks show that at least half of the instructions executed used a bypassed register result from a previous instruction. Missing bypasses induce interlock stalls. The paper reports measurements of the performance inpact of a number of pipeline configurations with incomplete bypassing networks. This impact ranges from a slowdown of just a few percent for a configuration with one late bypass missing to a slowdown of almost a factor of two for the integer pipe with no bypassing at all. Two types of code alterations reduce the new interlock stalls. A simple code transformation, th...
Achieving High Levels of Instruction-Level Parallelism With Reduced Hardware Complexity
, 1997
"... instruction-level parallelism, VLIW processors, superscalar processors, overlapped execution, out-of-order execution, speculative execution, branch prediction, instruction scheduling, compile-time speculation, predicated execution, data speculation, HPL PlayDoh Instruction-level parallel processing ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
instruction-level parallelism, VLIW processors, superscalar processors, overlapped execution, out-of-order execution, speculative execution, branch prediction, instruction scheduling, compile-time speculation, predicated execution, data speculation, HPL PlayDoh Instruction-level parallel processing (ILP) has established itself as the only viable approach for achieving the goal of providing continuously increasing performance without having to fundamentally re-write the application. ILP processors differ in their strategies for deciding exactly when, and on which functional unit, an operation should be executed. The alternatives lie somewhere on a spectrum depending on the extent to which these decisions are made by the compiler rather than by the hardware and on the manner in which information regarding parallelism is communicated by the compiler to the hardware via the program. HPL PlayDoh is a research architecture that has been defined to support research in ILP, with a bias towards VLIW processing. The overall objective of this research effort is to develop a suite of architectural features and compiler techniques that will enable a secondgeneration of VLIW processors to achieve high levels of ILP, across both scientific and non-scientific computations, but with hardware that is simple compared to out-of-order superscalar processors. The basic approach is to provide the program (compiler) more control over capabilities that, in superscalar processors, are typically microarchitectural (i.e., controlled by the hardware) by raising them to the architectural level.
Operation Tables for Scheduling in the Presence of Incomplete Bypassing
- in CODES+ISSS ’04: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
, 2004
"... Register byp ssing is ap owerful and widely used feature in modernp rocessors to eliminate certain data hazards. Although comp lete byp assing is ideal forp erformance, byp assing has significantimp act on cycle time, area, andp ower consump4 on of the pq cessor. Due to the strict con ..."
Abstract
-
Cited by 12 (9 self)
- Add to MetaCart
Register byp ssing is ap owerful and widely used feature in modernp rocessors to eliminate certain data hazards. Although comp lete byp assing is ideal forp erformance, byp assing has significantimp act on cycle time, area, andp ower consump4 on of the pq cessor. Due to the strict constraints onp erformance, cost andp ower consump3 on in embedded p rocessors, architects need to evaluate and imp lement incompL [: register by p ssing mechanisms. However traditional data hazard detection and/or avoidance techniques used in retargetable schedulers break down in thep resence of incomp ete by p ssing. In thisp ap er, wep resent the concep of Op eration Tables, which can be used to detect data hazards, even in the pq sence of incomp ete by p ssing. Furthermore our technique integrates the detection of both data, as well as resource hazards, and can be easilyemp loyed in a comp iler to generate better schedules. Our exp erimental results on thep op4 ar Intel XScale embeddedp rocessorp latform show that even with a simp e intra-basic block scheduling technique, we achieveupP 20%p erformance impL vement over fully op timized GCC generated code on embedded ap p lications from the MiBench suite.
PBExplore: A framework for compiler-in-the-loop exploration of partial bypassing in embedded processors
- In DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe
, 2005
"... Varying partial bypassing in pipelined processors is an effective way to make performance, area and energy tradeoffs in embedded processors. However, performance evaluation of partial bypassing in processors has been inaccurate, largely due to the absence of bypass-sensitive retargetable compilation ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Varying partial bypassing in pipelined processors is an effective way to make performance, area and energy tradeoffs in embedded processors. However, performance evaluation of partial bypassing in processors has been inaccurate, largely due to the absence of bypass-sensitive retargetable compilation techniques. Furthermore no existing partial bypass exploration framework estimates the power and cost overhead of partial bypassing. In this paper we present PBExplore: A framework for Compiler-in-the-Loop exploration of partial bypassing in processors. PBExplore accurately evaluates the performance of a partially bypassed processor using a generic bypass-sensitive compilation technique. It synthesizes the bypass control logic and estimates the area and energy overhead of each bypass configuration. PBExplore is thus able to effectively perform multi-dimensional exploration of the partial bypass design space. We present experimental results on the Intel XScale architecture on MiBench benchmarks and demonstrate the need, utility and exploration capabilities of PBExplore. 1
Iterative Compilation and Performance Prediction for Numerical Applications
, 2004
"... As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers fr ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of
Electronic computers: a historical survey
- ACM Computing Surveys
, 1969
"... The first large scale electronic computers were built in connection with university projects sponsored by government military and research organizations. Many established companies, as well as new companies, entered the computer field during the first generation, 1947-1959, in which the vacuum tube ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The first large scale electronic computers were built in connection with university projects sponsored by government military and research organizations. Many established companies, as well as new companies, entered the computer field during the first generation, 1947-1959, in which the vacuum tube was almost universally used as the active component in the implementation of computer logic. The second generation was characterized by the transistorized computers that began to appear in 1959 ~ Some of the computers built then and since are considered super computers; they attempt to go to the limit of current technology in terms of size, speed, and logical complexity. From 1965 onward, most new. computers belong to a third generation, which features integrated circuit technology and multiprocessor multiprogramming systems. Key words and phrases: electronic computers, computer history, time-sharing, vacuum tube computers, transistorized computers, super computers, magnetic drum computers, university computer projects CR categories: 1.2, 1.3 A complete history of electronic comput-
FLASH: Foresighted latency-aware scheduling heuristic for processors with customized datapaths
- In CGO
, 2004
"... Application-specific instruction set processors (ASIPs) have the potential to meet the challenging cost, performance, and power goals of future embedded processors by customizing the hardware to suit an application. A central problem is creating compilers that are capable of dealing with the heterog ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Application-specific instruction set processors (ASIPs) have the potential to meet the challenging cost, performance, and power goals of future embedded processors by customizing the hardware to suit an application. A central problem is creating compilers that are capable of dealing with the heterogeneous and non-uniform hardware created by the customization process. The processor datapath provides an effective area to customize, but specialized datapaths often have non-uniform connectivity between the function units, making the effective latency of a function unit dependent on the consuming operation. Traditional instruction schedulers break down in this environment due to their locally greedy nature of binding the best choice for a single operation even though that choice may be poor due to a lack of communication paths. To effectively schedule with non-uniform connectivity, we propose a foresighted latencyaware scheduling heuristic (FLASH) that performs lookahead across future scheduling steps to estimate the effects of a potential binding. FLASH combines a set of lookahead heuristics to achieve effective foresight with low compiletime overhead. 1.
Early Reply: A Basis for Pipebranching Parallelism with Sequential Reasoning
, 2002
"... Pipelining is a hardware technique for realizing instruction-level parallelism during program execution. To minimize pipeline stalls due to data hazards, forwarding is used to make operands available to future instructions as soon as they are produced by earlier instructions. Thus, forwarding partit ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Pipelining is a hardware technique for realizing instruction-level parallelism during program execution. To minimize pipeline stalls due to data hazards, forwarding is used to make operands available to future instructions as soon as they are produced by earlier instructions. Thus, forwarding partitions the pipeline into two phases: the material computation required to produce the results, and the residual computation required to complete the instruction. We extend this distinction to software components by introducing Early-Reply as a basis for component-level parallelism. In general, a component method completes its material computation as soon as its postcondition is satisfied. At this point, the output parameters can be forwarded to the client caller using Early-Reply without relinquishing control to complete the method. This enables the client to proceed in parallel with the residual computation of the component method. Since residual computations cannot violate their postconditions, Early-Reply components enable layered applications to be designed and reasoned about as sequential programs, despite their potential for exploiting physical concurrency at run-time. When Early-Reply components are composed hierarchically, higher-level calls can cascade to form a superlinear pipeline of method invocations on lower-level subcomponents. This potential for pipebranching parallelism can realize order-of-magnitude improvements in method response time among concurrently executing component operations.

