Results 1 - 10
of
18
Complexity-Effective Superscalar Processors
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for ..."
Abstract
-
Cited by 385 (5 self)
- Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and0:18 m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simplified and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines. 1
Quantifying the Complexity of Superscalar Processors
, 1996
"... The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and o ..."
Abstract
-
Cited by 72 (0 self)
- Add to MetaCart
The delay of pipeline structures in superscalar processors are studied to determine their potential for limiting clock cycle times in future designs. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m.
Interconnect opportunities for gigascale integration
- IEEE Micro
"... Published by the IEEE Computer Society In the early 1970s, the intrinsic switching delay of a MOSFET exceeded the time-of-flight dominated latency of a benchmark 1-mm aluminum interconnect by more than a factor of 100. In their classic 1974 work, “Design of Ion-Implanted MOSFET’s with Very Small Phy ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Published by the IEEE Computer Society In the early 1970s, the intrinsic switching delay of a MOSFET exceeded the time-of-flight dominated latency of a benchmark 1-mm aluminum interconnect by more than a factor of 100. In their classic 1974 work, “Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions, ” Dennard et al. pointed out that for constant electric-field scaling, the latency of a MOSFET scales as 1/S, where S is the device scaling factor. 1 In contrast, the latency of an RC-dominated local interconnect scales as 1 (unity) for S> 1. This observation was a harbinger of the future adverse impact of interconnects on the
Limits of Scaling MOSFETs
, 1995
"... In this paper the fundamental electrical limits of MOSFETs are discussed and modeled to predict the scaling limits of digital bulk CMOS circuits. Limits discussed include subthreshold leakage, short channel effects (SCE), gate induced drain leakage (GIDL), gate tunneling current, time dependent diel ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In this paper the fundamental electrical limits of MOSFETs are discussed and modeled to predict the scaling limits of digital bulk CMOS circuits. Limits discussed include subthreshold leakage, short channel effects (SCE), gate induced drain leakage (GIDL), gate tunneling current, time dependent dielectric breakdown (TDDB), and hot carrier effects (HCE). This paper predicts the scaling of bulk CMOS MOSFETs for high performance microprocessors to reach its limits at drawn lengths of approximately 0:08¯m. Trends in scaling interconnects are also discussed. The device limits presented are used to project the characteristics of future processor technologies and to find scaling factors for the SPICE level 3 model parameters. A SPICE device model which can be scaled to reflect a range of MOSFET technologies from drawn lengths of 0:5¯m to 0:1¯m is presented along with a scalable wire model. Key Words and Phrases: MOSFET, device scaling, interconnect scaling, subthreshold leakage, short channel...
Test challenges for deep sub-micron technologies
- in Proc. 37th Design Automation Conf
, 2000
"... The use of deep submicron process technologies presents several new challenges in the area of manufacturing test. While a significant body of work has been devoted to identifying and investigating design challenges in nanometer technologies, the impact on test strategies and methodologies is still n ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The use of deep submicron process technologies presents several new challenges in the area of manufacturing test. While a significant body of work has been devoted to identifying and investigating design challenges in nanometer technologies, the impact on test strategies and methodologies is still not well understood. This paper highlights the challenges to current test methodologies arising from technology driven trends, and will present an overview of emerging techniques that address deep submicron test challenges. 1.
A high-performance, hierarchical decoupled architecture
- In: Proceedings of the Memory Access Decoupling for SuperScalar and Multiple Issue Architecture Workship
, 1997
"... This paper presents a novel, high-performance decoupled architecture called the HiDISC 1 (Hierarchical Decoupled Instruction Stream Computer). The HiDISC provides high performance for loop-based scientific programs by exploiting instruction-level parallelism and improving memory system peformance by ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This paper presents a novel, high-performance decoupled architecture called the HiDISC 1 (Hierarchical Decoupled Instruction Stream Computer). The HiDISC provides high performance for loop-based scientific programs by exploiting instruction-level parallelism and improving memory system peformance by providing decoupled prefetching. In this paper, we present the HiDISC architeture, a sample program to show how the architecture works, and simulation results for nine scientific benchmarks and one symbolic benchmark. The performance advantage of the HiDISC architecture increases as the miss penalty gets larger relative to processor cycles, making it an attractive architecture as the difference between processor speed and DRAM speed grows exponentially. 1.
A Comparison of Analog and Digital Circuit Implementations of Low Power Matched Filters for Use in Portable Wireless Communication Terminals
- IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing
, 1997
"... The types of circuits in which analog design techniques are employed typically differ from those in which digital design methods are used, with analog circuits being commonly applied to high speed, low precision functional blocks such as mixers and RF modulators, while digital circuits are chosen fo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The types of circuits in which analog design techniques are employed typically differ from those in which digital design methods are used, with analog circuits being commonly applied to high speed, low precision functional blocks such as mixers and RF modulators, while digital circuits are chosen for high precision, high complexity blocks that operate at frequencies well below the fT of the transistors from which the circuits are comprised. Yet there still exist applications for which the superior circuit implementation---analog or digital---is unclear. The recent birth of commercial interest in spread-spectrum communications provides the motivation for investigating one such application, that of the parallel programmable matched filter. In this paper, analog and digital circuit realizations of a parallel programmable matched filter are examined. Through wide variations of the design space parameters, the general trend that is observed is that short, fast circuits tend to favor an anal...
Improving the Performance of Loop-Based Programs Using a Prefetch Processor
"... We present an architecture called the CAPP (Computing And Prefetching Processor). The CAPP provides high performance for loop-based scientific and signal processing programs by improving memory system performance by providing a decoupled prefetch processor. The prefetch processor improves perform ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We present an architecture called the CAPP (Computing And Prefetching Processor). The CAPP provides high performance for loop-based scientific and signal processing programs by improving memory system performance by providing a decoupled prefetch processor. The prefetch processor improves performance by relieving the main processor of prefetching instruction overhead and allowing the prefetch distance to vary adaptively at run-time. In this paper, we present the CAPP architecture, a sample program to show how the architecture works, and simulation results for five Livermore Loops, discrete convolution, and one other benchmark. The simulation results show a speedup of up to two to three for CAPP compared to a uniprocessor with prefetching. The performance advantage of the CAPP architecture increases as the miss penalty gets larger relative to processor cycles, making it an attractive architecture as the difference between processor speed and DRAM speed continues to grow exponentially. 1.
Synthesized compact models of substrate noise coupling analysis and synthesis in mixed-signal ICs, ” in DATE 2004, February 2004, pp. 836-841. A. List of test structures used to test the suitability of GMD TABLE A-1. Dimensions of the test structures us
- Contact 2 (µm X µm) 1 10 x 10 10 x 10 2 20 x 2 3 x 24 3 10 x 10 2 x 2 4 4 x 7 6 x 8 5 20 x 2 5 x 5 6 4 x 16 3 x 3 7 20 x 2 18 x 3 8 3 x 25 4 x 20
"... ii ..."
The BubbleWrap Many-Core: Popping Cores for Sequential Acceleration
"... Many-core scaling now faces a power wall. The gap between the number of cores that fit on a die and the number that can operate simultaneously under the power budget is rapidly increasing with technology scaling. In future designs, many of the cores may have to be dormant at any given time to meet t ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Many-core scaling now faces a power wall. The gap between the number of cores that fit on a die and the number that can operate simultaneously under the power budget is rapidly increasing with technology scaling. In future designs, many of the cores may have to be dormant at any given time to meet the power budget. To push back the many-core power wall, this paper proposes Dynamic Voltage Scaling for Aging Management (DVSAM) — a new scheme for managing processor aging to attain higher performance or lower power consumption. In addition, this paper introduces the BubbleWrap many-core, a novel architecture that makes extensive use of DVSAM. BubbleWrap identifies the most power-efficient set of cores in a variation-affected chip — the largest set that can be simultaneously powered-on — and designates them as Throughput cores dedicated to parallel-section execution. The rest of the cores are designated as Expendable and are dedicated to accelerating sequential sections. BubbleWrap attains maximum sequential acceleration by sacrificing Expendable cores one at a time, running them at elevated supply voltage for a significantly shorter service life each, until they completely wear-out and are discarded — figuratively, as if popping bubbles in bubble wrap that protects Throughput cores. In simulated 32-core chips, BubbleWrap provides substantial improvements over a plain chip. For example, on average, one design runs fully-sequential applications at a 16 % higher frequency, and fully-parallel ones with a 30 % higher throughput.

