Results 1 - 10
of
148
Token flow control
"... As companies move towards many-core chips, an efficient onchip communication fabric to connect these cores assumes critical importance. To address limitations to wire delay scalability and increasing bandwidth demands, state-of-the-art on-chip networks use a modular packet-switched design with route ..."
Abstract
-
Cited by 635 (35 self)
- Add to MetaCart
As companies move towards many-core chips, an efficient onchip communication fabric to connect these cores assumes critical importance. To address limitations to wire delay scalability and increasing bandwidth demands, state-of-the-art on-chip networks use a modular packet-switched design with routers at every hop which allow sharing of network channels over multiple packet flows. This, however, leads to packets going through a complex router pipeline at every hop, resulting in the overall communication energy/delay being dominated by the router overhead, as opposed to just wire energy/delay. In this work, we propose token flow control (TFC), a flow control mechanism in which nodes in the network send out tokens in their local neighborhood to communicate information about their available resources. These tokens are then used in both routing and flow control: to choose less congested paths in the network and to bypass the router pipeline along those paths. These bypass paths are formed dynamically, can be arbitrarily long and, are highly flexible with the ability to match to a packetâs exact route. Hence, this allows packets to potentially skip all routers along their path from source to destination, approaching the communication energy-delaythroughput of dedicated wires. Our detailed implementation analysis shows TFC to be highly scalable and realizable at an aggressive target clock cycle delay of 21FO4 for large networks while requiring low hardware complexity. Evaluations of TFC using both synthetic traffic and traces from the SPLASH-2 benchmark suite show reduction in packet latency by up to 77.1 % with upto 39.6 % reduction in average router energy consumption as compared to a state-of-theart baseline packet-switched design. For the same saturation throughput as the baseline network, TFC is able to reduce the amount of buffering by 65 % leading to a 48.8 % reduction in leakage energy and a 55.4 % lower total router energy.
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
- IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2007
"... A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performa ..."
Abstract
-
Cited by 122 (18 self)
- Add to MetaCart
(Show Context)
A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire mod- els (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model Non-uniform Cache Access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies.
A router architecture for connection-oriented service guarantees
, 2005
"... On-chip networks for future system-on-chip designs need simple, high performance implementations. In order to promote system-level integrity, guaranteed services (GS) need to be provided. We propose a network-on-chip (NoC) router architecture to support this, and demonstrate with a CMOS standard cel ..."
Abstract
-
Cited by 99 (5 self)
- Add to MetaCart
(Show Context)
On-chip networks for future system-on-chip designs need simple, high performance implementations. In order to promote system-level integrity, guaranteed services (GS) need to be provided. We propose a network-on-chip (NoC) router architecture to support this, and demonstrate with a CMOS standard cell design. Our implementation is based on clockless circuit techniques, and thus inherently supports a modular, GALS-oriented design flow. Our router exploits virtual channels to provide connection-oriented GS, as well as connection-less best-effort (BE) routing. The architecture is highly flexible, in that support for different types of BE routing and GS arbitration can be easily plugged into the router. 1
Flattened butterfly topology for on-chip networks
- IEEE Comput. Arch. Lett
, 2007
"... With the trend towards increasing number of cores in chip multiprocessors, the on-chip interconnect that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip interconnection networks and describe how the flattened butterfly topology can be ..."
Abstract
-
Cited by 90 (7 self)
- Add to MetaCart
(Show Context)
With the trend towards increasing number of cores in chip multiprocessors, the on-chip interconnect that connects the cores needs to scale efficiently. In this work, we propose the use of high-radix networks in on-chip interconnection networks and describe how the flattened butterfly topology can be mapped to on-chip networks. By using high-radix routers to reduce the diameter of the network, the flattened butterfly offers lower latency and energy consumption than conventional on-chip topologies. In addition, by exploiting the two dimensional planar VLSI layout, the on-chip flattened butterfly can exploit the bypass channels such that non-minimal routing can be used with minimal impact on latency and energy consumption. We evaluate the flattened butterfly and compare it to alternate on-chip topologies using synthetic traffic patterns and traces and show that the flattened butterfly can increase throughput by up to 50 % compared to a concentrated mesh and reduce latency by 28 % while reducing the power consumption by 38 % compared to a mesh network. 1
Express Virtual Channels: Towards the Ideal Interconnection Fabric
- in In Proceedings of ISCA-34
, 2007
"... Due to wire delay scalability and bandwidth limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect different processing elements in many-core chips. However, current state-ofthe-art ..."
Abstract
-
Cited by 80 (13 self)
- Add to MetaCart
(Show Context)
Due to wire delay scalability and bandwidth limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect different processing elements in many-core chips. However, current state-ofthe-art packet-switched networks rely on complex routers which increases the communication overhead and energy consumption as compared to the ideal interconnection fabric. In this paper, we try to close the gap between the stateof-the-art packet-switched network and the ideal interconnect by proposing express virtual channels (EVCs), a novel flow control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks. Our evaluation results using a detailed cycle-accurate simulator on a range of synthetic traffic and SPLASH benchmark traces show upto 84 % reduction in packet latency and upto 23 % improvement in throughput while reducing the average router energy consumption by upto 38 % over an existing state-of-the-art packet-switched design. When compared to the ideal interconnect, EVCs add just two cycles to the no-load latency, and are within 14 % of the ideal throughput. Moreover, we show that the proposed design incurs a minimal hardware overhead while exhibiting excellent scalability with increasing network sizes.
Design and management of 3D chip multiprocessors using network-in-memory
- in Proc. Int. Symp. Comput
, 2006
"... Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This mo-tivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multi-ple device layers are stacked together. Considering th ..."
Abstract
-
Cited by 72 (4 self)
- Add to MetaCart
(Show Context)
Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This mo-tivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multi-ple device layers are stacked together. Considering the current trends towards increasing use of chip multiprocessing, it is timely to consider 3D chip multiprocessor design and memory network-ing issues, especially in the context of data management in large L2 caches. The overall goal of this paper is to study the chal-lenges for L2 design and management in 3D chip multiproces-sors. Our first contribution is to propose a router architecture and a topology design that makes use of a network architecture em-bedded into the L2 cache memory. Our second contribution is to demonstrate, through extensive experiments, that a 3D L2 memory architecture generates much better results than the conventional two-dimensional (2D) designs under different number of layers and vertical (inter-wafer) connections. In particular, our experi-ments show that a 3D architecture with no dynamic data migration generates better performance than a 2D architecture that employs data migration. This also helps reduce power consumption in L2 due to a reduced number of data movements. 1
A Case for Bufferless Routing in On-Chip Networks
- ISCA'09
, 2009
"... Buffers in on-chip networks consume significant energy, occupy chip area, and increase design complexity. In this paper, we make a case for a new approach to designing on-chip interconnection networks that eliminates the need for buffers for routing or flow control. We describe new algorithms for ro ..."
Abstract
-
Cited by 67 (16 self)
- Add to MetaCart
(Show Context)
Buffers in on-chip networks consume significant energy, occupy chip area, and increase design complexity. In this paper, we make a case for a new approach to designing on-chip interconnection networks that eliminates the need for buffers for routing or flow control. We describe new algorithms for routing without using buffers in router input/output ports. We analyze the advantages and disadvantages of bufferless routing and discuss how router latency can be reduced by taking advantage of the fact that input/output buffers do not exist. Our evaluations show that routing without buffers significantly reduces the energy consumption of the on-chip cache/processor-to-cache network, while providing similar performance to that of existing buffered routing algorithms at low network utilization (i.e., on most real applications). We conclude that bufferless routing can be an attractive and energy-efficient design option for onchip cache/processor-to-cache networks where network utilization is low.
Near-optimal worst-case throughput routing for two-dimensional mesh networks
- In International Symposium on Computer Architecture
, 2005
"... Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good ..."
Abstract
-
Cited by 55 (1 self)
- Add to MetaCart
(Show Context)
Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good worst-case and average-case throughput and (c) enable low-complexity (and hence, low latency) router implementation. In this paper, we focus on routing algorithms for an important class of interconnection networks: two dimensional (2D) mesh networks. Existing routing algorithms for mesh networks fail to satisfy one or more of design goals mentioned above. Variously, the routing algorithms suffer from poor worst case throughput (ROMM [13], DOR [23]), poor latency due to increased packet hops (VALIANT [31]) or increased latency due to hardware complexity (minimaladaptive [7, 30]). The major contribution of this paper is the design of an oblivious routing algorithm—O1TURN—with provable nearoptimal worst-case throughput, good average-case throughput, low design complexity and minimal number of network hops for 2D-mesh networks, thus satisfying all the stated design goals. O1TURN offers optimal worst-case throughput when the network radix (k in a kxk network) is even. When the network radix is odd, O1TURN is within a 1/k 2 factor of optimal worst-case throughput. O1TURN achieves superior or comparable average-case throughput with global traffic as well as local traffic. For example, O1TURN achieves 18.8%, 0.7 % and 13.6 % higher average-case throughput than DOR, ROMM and VALIANT routing, respectively when averaged over one million random traffic patterns on an 8x8 network. Finally, we demonstrate that O1TURN is well suited for a partitioned router implementation that is of similar delay complexity as a simple dimension-ordered router. Our implementation incurs a marginal increase in switch arbi-tration delay that is completely hidden in pipelined routers as it is not on the clock-critical path. 1.
A Low Latency Router Supporting Adaptivity for On-Chip
- Interconnects, ” ACM Design Automation Conf. (DAC’05
, 2005
"... The increased deployment of System-on-Chip designs has drawn attention to the limitations of on-chip interconnects. As a potential solution to these limitations, Networks-on-Chip (NoC) have been proposed. The NoC routing algorithm significantly influences the performance and energy consumption of th ..."
Abstract
-
Cited by 53 (5 self)
- Add to MetaCart
(Show Context)
The increased deployment of System-on-Chip designs has drawn attention to the limitations of on-chip interconnects. As a potential solution to these limitations, Networks-on-Chip (NoC) have been proposed. The NoC routing algorithm significantly influences the performance and energy consumption of the chip. We propose a router architecture which utilizes adaptive routing while maintaining low latency. The two-stage pipelined architecture uses look ahead routing, speculative allocation, and optimal output path selection concurrently. The routing algorithm benefits from congestionaware flow control, making better routing decisions. We simulate and evaluate the proposed architecture in terms of network latency and energy consumption. Our results indicate that the architecture is effective in balancing the performance and energy of NoC designs.
Outstanding Research Problems in NoC Design: Circuit-, Microarchitecture-, and System-Level Perspectives
"... Abstract—Networks-on-Chip (NoCs) have been recently proposed to replace global interconnects in order to alleviate complex communication problems. While several research problems concerning NoC design have been already addressed in the literature, many others remain to be solved. In this work, we fi ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Networks-on-Chip (NoCs) have been recently proposed to replace global interconnects in order to alleviate complex communication problems. While several research problems concerning NoC design have been already addressed in the literature, many others remain to be solved. In this work, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis and solution evaluation. Motivation, problem formulation, proposed approaches and open issues are discussed for each problem enumerated in the paper from circuit, micro-architecture and systemlevel perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective. Index terms — On-chip communication architecture, networks-onchip, multiprocessor system-on-chip, CMP. I.