Results 1 - 10
of
84
A Delay Model and Speculative Architecture for Pipelined Routers
- In International Symposium on High-Performance Computer Architecture
, 2001
"... This paper introduces a router delay model that accurately models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method employed, the delay of the flowcontrol credit path, and the sharing of crossbar ports across virtual ..."
Abstract
-
Cited by 94 (19 self)
- Add to MetaCart
This paper introduces a router delay model that accurately models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method employed, the delay of the flowcontrol credit path, and the sharing of crossbar ports across virtual channels. Motivated by this model, we introduce a microarchitecture for a speculative virtual-channel router that significantly reduces its router latency to that of a wormhole router. Simulations using our pipelined model give results that differ considerably from the commonly- assumed `unit-latency' model which is unreasonably optimistic. Using realistic pipeline models, we compare wormhole [6] and virtual-channel flow control [4]. Our results show that a speculative virtual-channel router has the same per-hop router latency as a wormhole router, while improving throughput by up to 40%. 1. Introduction Interconnection networks are used to connect processors to memories in multicomputers ...
GasP: A Minimal FIFO Control
, 2001
"... The GasP family of asynchronous circuits provides controls for simple pipelines, for branching and joining pipelines, for round-robin scatter and gathel; for datadependent scatter and gathel; and for join on demand through arbitration. The family is designed so that each stage operates at the speed ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
The GasP family of asynchronous circuits provides controls for simple pipelines, for branching and joining pipelines, for round-robin scatter and gathel; for datadependent scatter and gathel; and for join on demand through arbitration. The family is designed so that each stage operates at the speed of a three-inverter ring oscillator: Test chips in 0.35 micron technology exhibit throughput in excess of 1.5 giga data items per second
An Interconnect-Centric Design Flow for Nanometer Technologies
- Proceedings of the IEEE
, 1999
"... As the IC devices is scaled into nanometer dimen- sions and operates in giga-hertz frequencies, interconnect design and optimization have become critical in determining the system performance and reliability. ..."
Abstract
-
Cited by 58 (23 self)
- Add to MetaCart
As the IC devices is scaled into nanometer dimen- sions and operates in giga-hertz frequencies, interconnect design and optimization have become critical in determining the system performance and reliability.
eCACTI: An enhanced power estimation model for on-chip caches
- In Technical Report TR-04-28, CECS, UCI
, 2004
"... There is a growing need for accurate power models at the higher levels of design hierarchy. CACTI is a micro-architecture level tool widely used (i) to estimate power dissipation in caches and (ii) to determine the cache configuration that best meets the desired optimization criterion. However, we o ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
There is a growing need for accurate power models at the higher levels of design hierarchy. CACTI is a micro-architecture level tool widely used (i) to estimate power dissipation in caches and (ii) to determine the cache configuration that best meets the desired optimization criterion. However, we observed several limitations in CACTI that lead to inaccuracies in cache power estimates especially as we move to deep sub-micron (DSM) technologies: a) lack of models to account for leakage power, b) use of constant gate widths for most devices irrespective of its capacitive load, and c) lack of models to account for power dissipation in sub-blocks that are outside the time critical path. As a result, the cache configuration determined by CACTI may not be optimal because of these limitations. In this paper, we describe eCACTI (enhanced CACTI), a tool that addresses these limitations in CACTI thereby improving the accuracy of its power estimates. We validated eCACTI power estimates against SPICE based simulations on industrial designs. Furthermore, we show that for DSM technologies, CACTI does not generate power optimal cache configuration, which highlights the need for the enhancements we developed in eCACTI. Finally, we demonstrate the use of eCACTI to study the effects of (i) technology on cache leakage and total cache power, (ii) dual-Vth optimization on sub-block and
A Delay Model for Router Micro-architectures
- IEEE Micro
, 2000
"... . Current router models [2, 3, 5, 6] assume that clock cycle time depends solely on router latency. However, in practice, routers are heavily pipelined, making cycle time largely independent of router latency. In this paper, we describe a router delay model that accurately accounts for pipelining ba ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
. Current router models [2, 3, 5, 6] assume that clock cycle time depends solely on router latency. However, in practice, routers are heavily pipelined, making cycle time largely independent of router latency. In this paper, we describe a router delay model that accurately accounts for pipelining based on technology-independent delay estimates derived through detailed gate-level analysis. Simulations of realistic router pipelines show significant performance differences compared with the commonly-assumed unit-latency model. Using realistic pipeline models, we compared wormhole and virtual-channel flow control. Our results show that virtual channels incur a modest additional cycle of per-hop router latency which is more than offset by the 25-40% throughput improvement over a wormhole router. 1. Introduction Most current literature in interconnection networks reports comparisons of different flow control and routing techniques without considering implementation complexity and the impac...
Near-optimal worst-case throughput routing for two-dimensional mesh networks
- In International Symposium on Computer Architecture
, 2005
"... Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good worst-case and average-case throughput and (c) enable low-complexity (and hence, low latency) router implementation. In this paper, we focus on routing algorithms for an important class of interconnection networks: two dimensional (2D) mesh networks. Existing routing algorithms for mesh networks fail to satisfy one or more of design goals mentioned above. Variously, the routing algorithms suffer from poor worst case throughput (ROMM [13], DOR [23]), poor latency due to increased packet hops (VALIANT [31]) or increased latency due to hardware complexity (minimaladaptive [7, 30]). The major contribution of this paper is the design of an oblivious routing algorithm—O1TURN—with provable nearoptimal worst-case throughput, good average-case throughput, low design complexity and minimal number of network hops for 2D-mesh networks, thus satisfying all the stated design goals. O1TURN offers optimal worst-case throughput when the network radix (k in a kxk network) is even. When the network radix is odd, O1TURN is within a 1/k 2 factor of optimal worst-case throughput. O1TURN achieves superior or comparable average-case throughput with global traffic as well as local traffic. For example, O1TURN achieves 18.8%, 0.7 % and 13.6 % higher average-case throughput than DOR, ROMM and VALIANT routing, respectively when averaged over one million random traffic patterns on an 8x8 network. Finally, we demonstrate that O1TURN is well suited for a partitioned router implementation that is of similar delay complexity as a simple dimension-ordered router. Our implementation incurs a marginal increase in switch arbi-tration delay that is completely hidden in pipelined routers as it is not on the clock-critical path. 1.
Methods for True Energy-Performance Optimization
, 2004
"... This paper presents methods for efficient energyperformance optimization at the circuit and micro-architectural levels. The optimal balance between energy and performance is achieved when the sensitivity of energy to a change in performance is equal for all the design variables. The sensitivity-base ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
This paper presents methods for efficient energyperformance optimization at the circuit and micro-architectural levels. The optimal balance between energy and performance is achieved when the sensitivity of energy to a change in performance is equal for all the design variables. The sensitivity-based optimizations minimize energy subject to a delay constraint. Energy savings of about 65% can be achieved without delay penalty with equalization of sensitivities to sizing, supply, and threshold voltage in a 64-bit adder, compared to the reference design sized for minimum delay. Circuit optimization is effective only in the region of about 30% around the reference delay; outside of this region the optimization becomes too costly either in terms of energy or delay. Using optimal energy--delay tradeoffs from the circuit level and introducing more degrees of freedom, the optimization is hierarchically extended to higher abstraction layers. We focus on the micro-architectural optimization and demonstrate that the scope of energy-efficient optimization can be extended by the choice of circuit topology or the level of parallelism. In a 64-bit ALU example, parallelism of five provides a three-fold performance increase, while requiring the same energy as the reference design. Parallel or time-multiplexed solutions significantly affect the area of their respective designs, so the overall design cost is minimized when optimal energy--area tradeoff is achieved.
Gate Sizing Using Incremental Parameterized Statistical Timing Analysis
- In ICCAD
, 2005
"... Abstract — As technology scales into the sub-90nm domain, manufacturing variations become an increasingly significant portion of circuit delay. As a result, delays must be modeled as statistical distributions during both analysis and optimization. This paper uses incremental, parametric statistical ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
Abstract — As technology scales into the sub-90nm domain, manufacturing variations become an increasingly significant portion of circuit delay. As a result, delays must be modeled as statistical distributions during both analysis and optimization. This paper uses incremental, parametric statistical static timing analysis (SSTA) to perform gate sizing with a required yield target. Both correlated and uncorrelated process parameters are considered by using a first-order linear delay model with fitted process sensitivities. The fitted sensitivities are verified to be accurate with circuit simulations. Statistical information in the form of criticality probabilities are used to actively guide the optimization process which reduces run-time and improves area and performance. The gate sizing results show a significant improvement in worst slack at 99.86 % yield over deterministic optimization. I.
Digital Circuit Optimization via Geometric Programming
- Operations Research
, 2005
"... informs ® doi 10.1287/opre.1050.0254 © 2005 INFORMS This paper concerns a method for digital circuit optimization based on formulating the problem as a geometric program (GP) or generalized geometric program (GGP), which can be transformed to a convex optimization problem and then very efficiently s ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
informs ® doi 10.1287/opre.1050.0254 © 2005 INFORMS This paper concerns a method for digital circuit optimization based on formulating the problem as a geometric program (GP) or generalized geometric program (GGP), which can be transformed to a convex optimization problem and then very efficiently solved. We start with a basic gate scaling problem, with delay modeled as a simple resistor-capacitor (RC) time constant, and then add various layers of complexity and modeling accuracy, such as accounting for differing signal fall and rise times, and the effects of signal transition times. We then consider more complex formulations such as robust design over corners, multimode design, statistical design, and problems in which threshold and power supply voltage are also variables to be chosen. Finally, we look at the detailed design of gates and interconnect wires, again using a formulation that is compatible with GP or GGP.
Methods for true power minimization
- in ICCAD
, 2002
"... This paper presents methods for efficient power minimization at circuit and micro-architectural levels. The potential energy savings are strongly related to the energy profile of a circuit. These savings are obtained by using gate sizing, supply voltage, and threshold voltage optimization, to minimi ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
This paper presents methods for efficient power minimization at circuit and micro-architectural levels. The potential energy savings are strongly related to the energy profile of a circuit. These savings are obtained by using gate sizing, supply voltage, and threshold voltage optimization, to minimize energy consumption subject to a delay constraint. The true power minimization is achieved when the energy reduction potentials of all tuning variables are balanced. We derive the sensitivity of energy to delay for each of the tuning variables connecting its energy saving potential to the physical properties of the circuit. This helps to develop understanding of optimization performance and identify the most efficient techniques for energy reduction. The optimizations are applied to some examples that span typical circuit topologies including inverter chains, SRAM decoders, and adders. At a delay of 20 % larger than the minimum, energy savings of 40 % to 70 % are possible, indicating that achieving peak performance is expensive in terms of energy. Energy savings of about 50 % can be achieved without delay penalty with the balancing of sizes, supplies, and thresholds. 1.

