Results 1  10
of
205
A Delay Model and Speculative Architecture for Pipelined Routers
 In International Symposium on HighPerformance Computer Architecture
, 2001
"... This paper introduces a router delay model that accurately models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method employed, the delay of the flowcontrol credit path, and the sharing of crossbar ports across virtual ..."
Abstract

Cited by 189 (25 self)
 Add to MetaCart
(Show Context)
This paper introduces a router delay model that accurately models key aspects of modern routers. The model accounts for the pipelined nature of contemporary routers, the specific flow control method employed, the delay of the flowcontrol credit path, and the sharing of crossbar ports across virtual channels. Motivated by this model, we introduce a microarchitecture for a speculative virtualchannel router that significantly reduces its router latency to that of a wormhole router. Simulations using our pipelined model give results that differ considerably from the commonly assumed `unitlatency' model which is unreasonably optimistic. Using realistic pipeline models, we compare wormhole [6] and virtualchannel flow control [4]. Our results show that a speculative virtualchannel router has the same perhop router latency as a wormhole router, while improving throughput by up to 40%. 1. Introduction Interconnection networks are used to connect processors to memories in multicomputers ...
GasP: A Minimal FIFO Control
, 2001
"... The GasP family of asynchronous circuits provides controls for simple pipelines, for branching and joining pipelines, for roundrobin scatter and gathel; for datadependent scatter and gathel; and for join on demand through arbitration. The family is designed so that each stage operates at the speed ..."
Abstract

Cited by 81 (1 self)
 Add to MetaCart
The GasP family of asynchronous circuits provides controls for simple pipelines, for branching and joining pipelines, for roundrobin scatter and gathel; for datadependent scatter and gathel; and for join on demand through arbitration. The family is designed so that each stage operates at the speed of a threeinverter ring oscillator: Test chips in 0.35 micron technology exhibit throughput in excess of 1.5 giga data items per second
An InterconnectCentric Design Flow for Nanometer Technologies
 Proceedings of the IEEE
, 1999
"... As the IC devices is scaled into nanometer dimen sions and operates in gigahertz frequencies, interconnect design and optimization have become critical in determining the system performance and reliability. ..."
Abstract

Cited by 80 (26 self)
 Add to MetaCart
(Show Context)
As the IC devices is scaled into nanometer dimen sions and operates in gigahertz frequencies, interconnect design and optimization have become critical in determining the system performance and reliability.
Nearoptimal worstcase throughput routing for twodimensional mesh networks
 In International Symposium on Computer Architecture
, 2005
"... Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
(Show Context)
Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good worstcase and averagecase throughput and (c) enable lowcomplexity (and hence, low latency) router implementation. In this paper, we focus on routing algorithms for an important class of interconnection networks: two dimensional (2D) mesh networks. Existing routing algorithms for mesh networks fail to satisfy one or more of design goals mentioned above. Variously, the routing algorithms suffer from poor worst case throughput (ROMM [13], DOR [23]), poor latency due to increased packet hops (VALIANT [31]) or increased latency due to hardware complexity (minimaladaptive [7, 30]). The major contribution of this paper is the design of an oblivious routing algorithm—O1TURN—with provable nearoptimal worstcase throughput, good averagecase throughput, low design complexity and minimal number of network hops for 2Dmesh networks, thus satisfying all the stated design goals. O1TURN offers optimal worstcase throughput when the network radix (k in a kxk network) is even. When the network radix is odd, O1TURN is within a 1/k 2 factor of optimal worstcase throughput. O1TURN achieves superior or comparable averagecase throughput with global traffic as well as local traffic. For example, O1TURN achieves 18.8%, 0.7 % and 13.6 % higher averagecase throughput than DOR, ROMM and VALIANT routing, respectively when averaged over one million random traffic patterns on an 8x8 network. Finally, we demonstrate that O1TURN is well suited for a partitioned router implementation that is of similar delay complexity as a simple dimensionordered router. Our implementation incurs a marginal increase in switch arbitration delay that is completely hidden in pipelined routers as it is not on the clockcritical path. 1.
A Delay Model for Router Microarchitectures
 IEEE Micro
, 2000
"... . Current router models [2, 3, 5, 6] assume that clock cycle time depends solely on router latency. However, in practice, routers are heavily pipelined, making cycle time largely independent of router latency. In this paper, we describe a router delay model that accurately accounts for pipelining ba ..."
Abstract

Cited by 48 (3 self)
 Add to MetaCart
(Show Context)
. Current router models [2, 3, 5, 6] assume that clock cycle time depends solely on router latency. However, in practice, routers are heavily pipelined, making cycle time largely independent of router latency. In this paper, we describe a router delay model that accurately accounts for pipelining based on technologyindependent delay estimates derived through detailed gatelevel analysis. Simulations of realistic router pipelines show significant performance differences compared with the commonlyassumed unitlatency model. Using realistic pipeline models, we compared wormhole and virtualchannel flow control. Our results show that virtual channels incur a modest additional cycle of perhop router latency which is more than offset by the 2540% throughput improvement over a wormhole router. 1. Introduction Most current literature in interconnection networks reports comparisons of different flow control and routing techniques without considering implementation complexity and the impac...
eCACTI: An enhanced power estimation model for onchip caches
 In Technical Report TR0428, CECS, UCI
, 2004
"... There is a growing need for accurate power models at the higher levels of design hierarchy. CACTI is a microarchitecture level tool widely used (i) to estimate power dissipation in caches and (ii) to determine the cache configuration that best meets the desired optimization criterion. However, we o ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
(Show Context)
There is a growing need for accurate power models at the higher levels of design hierarchy. CACTI is a microarchitecture level tool widely used (i) to estimate power dissipation in caches and (ii) to determine the cache configuration that best meets the desired optimization criterion. However, we observed several limitations in CACTI that lead to inaccuracies in cache power estimates especially as we move to deep submicron (DSM) technologies: a) lack of models to account for leakage power, b) use of constant gate widths for most devices irrespective of its capacitive load, and c) lack of models to account for power dissipation in subblocks that are outside the time critical path. As a result, the cache configuration determined by CACTI may not be optimal because of these limitations. In this paper, we describe eCACTI (enhanced CACTI), a tool that addresses these limitations in CACTI thereby improving the accuracy of its power estimates. We validated eCACTI power estimates against SPICE based simulations on industrial designs. Furthermore, we show that for DSM technologies, CACTI does not generate power optimal cache configuration, which highlights the need for the enhancements we developed in eCACTI. Finally, we demonstrate the use of eCACTI to study the effects of (i) technology on cache leakage and total cache power, (ii) dualVth optimization on subblock and
Methods for true energyperformance optimization
 IEEE Journal of SolidState Circuits
, 2004
"... ..."
Digital Circuit Optimization via Geometric Programming
 Operations Research
, 2005
"... informs ® doi 10.1287/opre.1050.0254 © 2005 INFORMS This paper concerns a method for digital circuit optimization based on formulating the problem as a geometric program (GP) or generalized geometric program (GGP), which can be transformed to a convex optimization problem and then very efficiently s ..."
Abstract

Cited by 42 (7 self)
 Add to MetaCart
(Show Context)
informs ® doi 10.1287/opre.1050.0254 © 2005 INFORMS This paper concerns a method for digital circuit optimization based on formulating the problem as a geometric program (GP) or generalized geometric program (GGP), which can be transformed to a convex optimization problem and then very efficiently solved. We start with a basic gate scaling problem, with delay modeled as a simple resistorcapacitor (RC) time constant, and then add various layers of complexity and modeling accuracy, such as accounting for differing signal fall and rise times, and the effects of signal transition times. We then consider more complex formulations such as robust design over corners, multimode design, statistical design, and problems in which threshold and power supply voltage are also variables to be chosen. Finally, we look at the detailed design of gates and interconnect wires, again using a formulation that is compatible with GP or GGP.
Gate Sizing Using Incremental Parameterized Statistical Timing Analysis
 In ICCAD
, 2005
"... Abstract — As technology scales into the sub90nm domain, manufacturing variations become an increasingly significant portion of circuit delay. As a result, delays must be modeled as statistical distributions during both analysis and optimization. This paper uses incremental, parametric statistical ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Abstract — As technology scales into the sub90nm domain, manufacturing variations become an increasingly significant portion of circuit delay. As a result, delays must be modeled as statistical distributions during both analysis and optimization. This paper uses incremental, parametric statistical static timing analysis (SSTA) to perform gate sizing with a required yield target. Both correlated and uncorrelated process parameters are considered by using a firstorder linear delay model with fitted process sensitivities. The fitted sensitivities are verified to be accurate with circuit simulations. Statistical information in the form of criticality probabilities are used to actively guide the optimization process which reduces runtime and improves area and performance. The gate sizing results show a significant improvement in worst slack at 99.86 % yield over deterministic optimization. I.
Methods for true power minimization
 in ICCAD
, 2002
"... This paper presents methods for efficient power minimization at circuit and microarchitectural levels. The potential energy savings are strongly related to the energy profile of a circuit. These savings are obtained by using gate sizing, supply voltage, and threshold voltage optimization, to minimi ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
(Show Context)
This paper presents methods for efficient power minimization at circuit and microarchitectural levels. The potential energy savings are strongly related to the energy profile of a circuit. These savings are obtained by using gate sizing, supply voltage, and threshold voltage optimization, to minimize energy consumption subject to a delay constraint. The true power minimization is achieved when the energy reduction potentials of all tuning variables are balanced. We derive the sensitivity of energy to delay for each of the tuning variables connecting its energy saving potential to the physical properties of the circuit. This helps to develop understanding of optimization performance and identify the most efficient techniques for energy reduction. The optimizations are applied to some examples that span typical circuit topologies including inverter chains, SRAM decoders, and adders. At a delay of 20 % larger than the minimum, energy savings of 40 % to 70 % are possible, indicating that achieving peak performance is expensive in terms of energy. Energy savings of about 50 % can be achieved without delay penalty with the balancing of sizes, supplies, and thresholds. 1.