Results 1 - 10
of
35
Cyclone: A broadcast-free dynamic instruction scheduler with selective replay
- In Proceedings of the 30th annual International Symposium on Computer Architecture
"... To achieve high instruction throughput, instruction schedulers must be capable of producing high-quality schedules that maximize functional unit utilization while at the same time enabling fast instruction issue logic. Many solutions exist to the scheduling problem, ranging from compile-time to run- ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
To achieve high instruction throughput, instruction schedulers must be capable of producing high-quality schedules that maximize functional unit utilization while at the same time enabling fast instruction issue logic. Many solutions exist to the scheduling problem, ranging from compile-time to run-time approaches. Compile-time solutions feature fast and simple hardware, but at the expense of conservative schedules. Dynamic schedulers produce high-quality schedules that incorporate run-time information and dependence speculation, but implementing these schedulers requires complex circuits that can slow processor clock speeds. In this paper, we present the Cyclone scheduler, a novel design that captures the benefits of both compileand run-time scheduling. Our approach utilizes a listbased single-pass instruction scheduling algorithm, implemented by hardware at run-time in the front end of the processor pipeline. Once scheduled, instructions are injected into a timed queue that orchestrates their entry into execution. To accommodate branch and load/store dependence speculation, the Cyclone scheduler supports a simple selective replay mechanism. We implement this technique by overloading instruction register forwarding to also detect instructions dependent on incorrectly scheduled operations. Detailed simulation analyses suggest that with sufficient queue width, the Cyclone scheduler can rival the instruction throughput of similarly wide monolithic dynamic schedulers. Furthermore, the circuit complexity of the Cyclone scheduler is much more favorable than a broadcast-based scheduler, as our approach requires no global control signals. 1
Hierarchical scheduling windows
- In Proceedings of the 35th International Symposium on Microarchitecture
, 2002
"... Large scheduling windows are an effective mechanism for increasing microprocessor performance through the extraction of instruction level parallelism. Current techniques do not scale effectively for very large windows, leading to slow wakeup and select logic as well as large complicated bypass netwo ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Large scheduling windows are an effective mechanism for increasing microprocessor performance through the extraction of instruction level parallelism. Current techniques do not scale effectively for very large windows, leading to slow wakeup and select logic as well as large complicated bypass networks. This paper introduces a new instruction scheduler implementation, referred to as Hierarchical Scheduling Windows or HSW, which exploits latency tolerant instructions in order to reduce implementation complexity. HSW yields a very large instruction window that tolerates wakeup, select, and bypass latency, while extracting significant far-flung ILP. Results: It is shown that HSW loses <0.5 % performance per additional cycle of bypass/select/wakeup latency as compared to a monolithic window that loses ~5 % per additional cycle. Also, HSW achieves the performance of traditional implementations with only 1/3 to 1/2 the number of entries in the critical timing path. 1.
Macro-op scheduling: Relaxing scheduling loop constraints
- In Proceedings of the International Symposium on Microarchitecture
, 2003
"... Ensuring back-to-back execution of dependent instructions in a conventional out-of-order processor requires scheduling logic that wakes up and selects instructions at the same rate as they are executed. To sustain high performance, integer ALU instructions typically have singlecycle latency, consequ ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Ensuring back-to-back execution of dependent instructions in a conventional out-of-order processor requires scheduling logic that wakes up and selects instructions at the same rate as they are executed. To sustain high performance, integer ALU instructions typically have singlecycle latency, consequently requiring scheduling logic with the same single-cycle latency. Prior proposals have advocated the use of speculation in either the wakeup or select phases to enable pipelining of scheduling logic to achieve higher clock frequency. In contrast, this paper proposes macro-op scheduling, which systematically removes instructions with single-cycle latency from the machine by combining them into macro-ops, and performs nonspeculative pipelined scheduling of multi-cycle operations. Macroop
Exploring Wakeup-Free Instruction Scheduling
"... Design of wakeup-free issue queues is becoming desirable due to the increasing complexity associated with broadcast-based instruction wakeup. The effectiveness of most wakeup-free issue queue designs is critically based on their success in predicting the issue latency of an instruction accurately. C ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Design of wakeup-free issue queues is becoming desirable due to the increasing complexity associated with broadcast-based instruction wakeup. The effectiveness of most wakeup-free issue queue designs is critically based on their success in predicting the issue latency of an instruction accurately. Consequently, the goal of this paper is to explore the predictability of instruction issue latency under different design constraints and to identify the impediments to performance in such wakeup-free architectures. Our results indicate that structural problems in promoting instructions to the head of the instruction queue from where they are issued in wakeup-free architectures, the limited number of candidate instructions that can be considered for instruction issue, and the resource conflicts due to non-availability of issue ports all have a significant impact in degrading the performance of broadcast free architectures. Based on these observation, we explore an architecture that attempts to overcome the structural limitations by employing traditional selection logic and by using pre-check logic to reduce the impact of resource conflicts while still employing a wakeup-free strategy based on predicted instruction issue latencies. Finally, we improve this technique by limiting the selection logic to a small segment of the issue queue.
Wire Delay is not a Problem for SMT (in the near future
- Proceedings of the 31st Annual International Symposium on Computer Architecture
, 2004
"... Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening p ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43 % increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future. 1
Defining Wakeup Width for Efficient Dynamic Scheduling
- in ICCD 2004
, 2004
"... A larger Dynamic Scheduler (DS) exposes more Instruction Level Parallelism (ILP), giving better performance. However, a larger DS also results in a longer scheduler latency and a slower clock speed. In this paper, we propose a new DS design that reduces the scheduler critical path latency by reducin ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
A larger Dynamic Scheduler (DS) exposes more Instruction Level Parallelism (ILP), giving better performance. However, a larger DS also results in a longer scheduler latency and a slower clock speed. In this paper, we propose a new DS design that reduces the scheduler critical path latency by reducing the wakeup width (defined as the effective number of results used for instruction wakeup). The design is based on the realization that the average number of results per cycle that are immediately required to wake up the dependent instructions is considerably less than the processor issue width. Our designs are evaluated using the simulation of the SPEC 2000 benchmarks and SPICE simulations of the actual issue queue layouts in 0.18 micron process. We found that a significant reduction in scheduler latency, power consumption and area is achieved with less than 2 % reduction in the Instructions per Cycle (IPC) count for the SPEC2K benchmarks. 1
Complexity-Effective Issue Queue Design Under Load-Hit Speculation
- In Proceedings of the Workshop Complexity-Effective Design, 2002
, 2002
"... Current trends in microprocessor designs indicate increasing pipeline depth in order to keep up with higher clock frequencies and increased architectural complexity. Speculatively issued instructions may be particularly sensitive to increase in pipeline depth, assuming that issued instructions are k ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Current trends in microprocessor designs indicate increasing pipeline depth in order to keep up with higher clock frequencies and increased architectural complexity. Speculatively issued instructions may be particularly sensitive to increase in pipeline depth, assuming that issued instructions are kept in the issue queue. In this paper, we evaluate the effectiveness of load hit speculation as pipeline depth increases. Effectiveness is measured in terms of performance improvement, issue queue size requirements and re-issue policy. Our results indicate that load hit speculation increases the percentage of issue queue instructions that are waiting to be re-issued, or replayed. This trend increases even more as pipelines become deeper. We propose an alternative, complexityeffective design for the issue queue, that takes into consideration the different utilization that load hit speculation demands from the issue queue. 1
Direct Instruction Wakeup for Out-Of-Order Processors
"... Abstract — Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Abstract — Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number of successor pointers plus a small number of dynamically allocated full successor bit vectors for cases with a larger number of successors. The details of the new organization are described and it is shown to achieve the performance of CAM-based or full dependency matrix organizations using just one pointer per instruction plus eight full bit vectors. Only two full bit vectors are needed when two successor pointers are stored per instruction. Finally, a design and pre-layout of all critical structures in 70nm technology was performed for the proposed organization as well as for a CAM-based baseline. The new design is shown to use 1/2 to 1/5th of the baseline instruction queue power, depending on queue size. It is also shown to use significantly less power than the full dependency matrix based design.
A Dependency Chain Clustered Microarchitecture
- In International Parallel and Distributed Processing Symposium
, 2005
"... In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. T ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we explore a new clustering approach for reducing the complexity of wide issue in-order processors based on EPIC architectures. Complexity effectiveness is achieved by heavily clustering the pipeline from decode to commit stage without the need for any direct bypass between clusters. This is made possible by assuming support for executing compilerconstructed traces. One trace is executed at a time by executing its coarse-grained dependency chains (DCs) in different in-order clusters. Since the DCs of a trace are mutually data independent of each other they can be executed in different clusters without any direct communication between them. To execute DCs in narrower clusters without compromising ILP, a compiler algorithm that splits large DCs by duplicating instructions is proposed.
Static strands: Safely exposing dependence chains for increasing embedded power efficiency
- In Proc. 2005 Conference on Languages, Compilers, and Tools for Embedded Systems
, 2005
"... Modern embedded processors are designed to maximize execution efficiency—the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency, we propose utilizing static strands, dependence chains without fan-out, which are exposed b ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Modern embedded processors are designed to maximize execution efficiency—the amount of performance achieved per unit of energy dissipated while meeting minimum performance levels. To increase this efficiency, we propose utilizing static strands, dependence chains without fan-out, which are exposed by a compiler pass. These dependent instructions are resequenced to be sequential and annotated to communicate their location to the hardware. Importantly, this modified application is binary compatible and functionally identical to the original, allowing transparent execution on a baseline processor. However, these static strands can be easily collapsed and optimized by simple processor modifications, significantly reducing the workload energy. Results show that over 30 % of MediaBench and Spec2000int dynamic instructions can be collapsed, reducing issue logic energy by 20%, bypass energy 19%, and register file energy 14%. In addition, by increasing the effective capactity of pipeline resources by almost a third, average IPC can be improved up to 15%. This performance gain can then be traded in for a lower clock frequency to maintain a basline

