Results 1 
5 of
5
Blueshift: Designing processors for timing speculation from the ground up
 In Proceesings of the 15th IEEE International Symposium on High Performance Computer Architecture (HPCA),pages
, 2009
"... Several recent processor designs have proposed to enhance performance by increasing the clock frequency to the point where timing faults occur, and by adding errorcorrecting support to guarantee correctness. However, such Timing Speculation (TS) proposals are limited in that they assume traditional ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Several recent processor designs have proposed to enhance performance by increasing the clock frequency to the point where timing faults occur, and by adding errorcorrecting support to guarantee correctness. However, such Timing Speculation (TS) proposals are limited in that they assume traditional design methodologies that are suboptimal under TS. In this paper, we present a new approach where the processor itself is designed from the ground up for TS. The idea is to identify and optimize the most frequentlyexercised critical paths in the design, at the expense of the majority of the static critical paths, which are allowed to suffer timing errors. Our approach and design optimization algorithm are called BlueShift. We also introduce two techniques that, when applied under BlueShift, improve processor performance: Ondemand Selective Biasing (OSB) and Path Constraint Tuning (PCT). Our evaluation with modules from the OpenSPARC T1 processor shows that, compared to conventional TS, BlueShift with OSB speeds up applications by an average of 8 % while increasing the processor power by an average of 12%. Moreover, compared to a highperformance TS design, BlueShift with PCT speeds up applications by an average of 6 % with an average processor power overhead of 23 % — providing a way to speed up logic modules that is orthogonal to voltage scaling. 1
BTIAware Design Using Variable Latency Units
"... Abstract—Circuit degradation due to bias temperature instability (BTI) can lead to timing failures in digital circuits. We develop variable latency unit (VLU) based BTIaware designs, with a novel scheme for multioutput hold logic implementation for VLUs. A key observation is the identification and ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Circuit degradation due to bias temperature instability (BTI) can lead to timing failures in digital circuits. We develop variable latency unit (VLU) based BTIaware designs, with a novel scheme for multioutput hold logic implementation for VLUs. A key observation is the identification and exploitation of specific supersetting patterns in the twodimensional space of frequency and aging of the circuit. The multioutput hold logic scheme is used in conjunction with an adaptive body bias framework to achieve high performance, allowing the design to be easily incorporated in traditional synthesis flows. As compared to conventional combinational BTIresilience scheme, our design achieves an area reduction of 9.2%, with a significant throughput enhancement of 30.0%. Bias Temperature Instability (BTI) [1], in the form of negative BTI (NBTI) in PMOS and positive BTI (PBTI) in NMOS transistors, is
IMPROVING PERTHREAD PERFORMANCE ON CMPS THROUGH TIMING SPECULATION
, 2009
"... The future of performance scaling lies in massively parallel workloads, but lessparallel applications will remain important. Unfortunately, future process technologies and core microarchitectures no longer promise major perthread performance improvements, so microarchitects must find new ways to a ..."
Abstract
 Add to MetaCart
The future of performance scaling lies in massively parallel workloads, but lessparallel applications will remain important. Unfortunately, future process technologies and core microarchitectures no longer promise major perthread performance improvements, so microarchitects must find new ways to address a growing perthread performance deficit. Moreover, they must do so without sacrificing parallel throughput. To meet these apparently conflicting demands, this dissertation proposes a Timing Speculation (TS) system for multicores that boosts core clock frequencies past their normal limits when an application demands perthread performance and operates efficiently at nominal frequency when it demands throughput. This work’s contributions are organized into three interlocking proposals. This work begins by introducing Paceline, the first TS microarchitecture designed specifically for multicores. Paceline enables two cores to work together to execute a single thread at high speed under TS or independently to execute two threads at the rated frequency. In singlethread mode, one core in the pair — the “Leader” — executes at higherthannormal frequency, while a “Checker” runs at the rated, safe frequency. The Leader runs the program faster but may experience timing errors. To detect and correct these errors, the Checker periodically compares a hash of its
VariationAware Variable Latency Design
 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
"... Although typical digital circuits are designed so that the clock period satisfies worstcase path delay constraints, the average input excitation often completes computation in less than a clock cycle. Variable latency units (VLUs) allow for improved throughput by allowing one clock cycle for some ..."
Abstract
 Add to MetaCart
Although typical digital circuits are designed so that the clock period satisfies worstcase path delay constraints, the average input excitation often completes computation in less than a clock cycle. Variable latency units (VLUs) allow for improved throughput by allowing one clock cycle for some computations, and two clock cycles for others, using hold logic to differentiate between the two cases. However, they may experience significant throughput losses due to the effects of process variations. We develop a combined presiliconpostsilicon technique for variationaware VLU design that ensures high throughputs across all manufactured chips. We achieve this by identifying path clusters at the presilicon stage, such that each element of a path cluster is likely to be similarly critical in a manufactured part. We use sensors to determine which path clusters is critical at the postsilicon stage and then activate the appropriate hold logics. Practically, for a small number of path clusters, significant improvements in throughput are achievable. On a set of 32nm PTMbased ISCAS89 circuits, our scheme offers 15.1 % throughput enhancements with only 3.3 % area overhead. The idea can be illustrated through the example of a 6bit ripple carry adder (RCA) shown in Fig. 1 [5]. With unit gate delays, the conventional singlecycle fixedlatency circuit has a cycle time, Tclk = 13 units, equal to the delay of its longest path, corresponding to a throughput, η1 = 1/13. The VLU implementation of this adder operates at a reduced cycle time, Tclk < 13. For Tclk = 9, assuming that all primary input signals are mutually independent and have signal probabilities of 50%, 18.75 % of the input patterns violate Tclk, and the VLU allows these to complete execution in two cycles. Under the 50 % assumption above, since each pattern is equiprobable, the average VLU delay is 0.8125 × 9 + 0.1875 × 18 = 10.69 units, and the throughput η2 = 1/10.69 is 21.6 % better.