Results 1  10
of
27
Nearoptimal worstcase throughput routing for twodimensional mesh networks
 In International Symposium on Computer Architecture
, 2005
"... Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good ..."
Abstract

Cited by 41 (0 self)
 Add to MetaCart
Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good worstcase and averagecase throughput and (c) enable lowcomplexity (and hence, low latency) router implementation. In this paper, we focus on routing algorithms for an important class of interconnection networks: two dimensional (2D) mesh networks. Existing routing algorithms for mesh networks fail to satisfy one or more of design goals mentioned above. Variously, the routing algorithms suffer from poor worst case throughput (ROMM [13], DOR [23]), poor latency due to increased packet hops (VALIANT [31]) or increased latency due to hardware complexity (minimaladaptive [7, 30]). The major contribution of this paper is the design of an oblivious routing algorithm—O1TURN—with provable nearoptimal worstcase throughput, good averagecase throughput, low design complexity and minimal number of network hops for 2Dmesh networks, thus satisfying all the stated design goals. O1TURN offers optimal worstcase throughput when the network radix (k in a kxk network) is even. When the network radix is odd, O1TURN is within a 1/k 2 factor of optimal worstcase throughput. O1TURN achieves superior or comparable averagecase throughput with global traffic as well as local traffic. For example, O1TURN achieves 18.8%, 0.7 % and 13.6 % higher averagecase throughput than DOR, ROMM and VALIANT routing, respectively when averaged over one million random traffic patterns on an 8x8 network. Finally, we demonstrate that O1TURN is well suited for a partitioned router implementation that is of similar delay complexity as a simple dimensionordered router. Our implementation incurs a marginal increase in switch arbitration delay that is completely hidden in pipelined routers as it is not on the clockcritical path. 1.
GOAL: A loadbalanced adaptive routing algorithm for torus networks
 International Symposium on Computer Architecture (ISCA
, 2003
"... We introduce a loadbalanced adaptive routing algorithm for torus networks, GOAL Globally Oblivious Adaptive Locally that provides high throughput on adversarial traffic patterns, matching or exceeding fully randomized routing and exceeding the worstcase performance of Chaos [2], RLB [14], and mi ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
We introduce a loadbalanced adaptive routing algorithm for torus networks, GOAL Globally Oblivious Adaptive Locally that provides high throughput on adversarial traffic patterns, matching or exceeding fully randomized routing and exceeding the worstcase performance of Chaos [2], RLB [14], and minimal routing [8] by more than 40%. GOAL also preserves locality to provide up to 4.6 × the throughput of fully randomized routing [19] on local traffic. GOAL achieves global load balance by randomly choosing the direction to route in each dimension. Local load balance is then achieved by routing in the selected directions adaptively. We compare the throughput, latency, stability and hotspot performance of GOAL to six previously published routing algorithms on six specific traffic patterns and 1,000 randomly generated permutations. 1
ThroughputCentric Routing Algorithm Design
 Past, Present, and Future,º Proc. 20th Anniversary Conf. Advanced Research in Very Large Systems Intelligence
, 2003
"... The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just its locality, becomes an important performance metric. We show that the problem of designing oblivious ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just its locality, becomes an important performance metric. We show that the problem of designing oblivious routing algorithms that have high worstcase or averagecase throughput can be cast as a linear program. Globally optimal solutions to these optimization problems can be efficiently found, yielding provably good oblivious routing algorithms.
Outstanding Research Problems in NoC Design: Circuit, Microarchitecture, and SystemLevel Perspectives
"... Abstract—NetworksonChip (NoCs) have been recently proposed to replace global interconnects in order to alleviate complex communication problems. While several research problems concerning NoC design have been already addressed in the literature, many others remain to be solved. In this work, we fi ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Abstract—NetworksonChip (NoCs) have been recently proposed to replace global interconnects in order to alleviate complex communication problems. While several research problems concerning NoC design have been already addressed in the literature, many others remain to be solved. In this work, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis and solution evaluation. Motivation, problem formulation, proposed approaches and open issues are discussed for each problem enumerated in the paper from circuit, microarchitecture and systemlevel perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective. Index terms — Onchip communication architecture, networksonchip, multiprocessor systemonchip, CMP. I.
Achieving predictable performance through better memory controller placement in manycore cmps
 In ISCA ’09: Proceedings of the 36th annual international symposium on Computer architecture
, 2009
"... In the near term, Moore’s law will continue to provide an increasing number of transistors and therefore an increasing number of onchip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers onchip. With many cores, and few memory controllers, where to locate ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
In the near term, Moore’s law will continue to provide an increasing number of transistors and therefore an increasing number of onchip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers onchip. With many cores, and few memory controllers, where to locate the memory controllers in the onchip interconnection fabric becomes an important and as yet unexplored question. In this paper, we show how the location of the memory controllers can reduce contention (hot spots) in the onchip fabric, as well as lower the variance in reference latency which provides for predictable performance of memoryintensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of onchip fabrics to find optimal memory controller placement relative to different topologies (i.e. mesh and torus), routing algorithms, and workloads. FBDIMM technology [8] and onboard memory buffers to serve as pin expanders converting from narrow serial channels to a wide address/data/control bus used by the memory part. Nonetheless, packaging constraints limited primarily by the number of available pins restrict the number of memory controllers to a small fraction relative to the number of processing cores. The reality of many cores with few memory controllers raises the important question of where the memory controllers should be located within the onchip network. The Tilera “Tile Architecture ” [26] is implemented as an 8×8 twodimensional mesh of tiles (Figure 1a). Packets are routed using dimensionorder routing and wormhole flow control. The Tilera onchip network uses five independent physical networks to 1
Localitypreserving randomized oblivious routing on torus networks
 in Proc. of the Symposium on Parallel Algorithms and Architectures
, 2002
"... We introduce Randomized Local Balance (RLB), a routing algorithm that strikes a balance between locality and load balance in torus networks, and analyze RLB’s performance for benign and adversarial traffic permutations. Our results show that RLB outperforms deterministic algorithms (25 % more bandwi ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We introduce Randomized Local Balance (RLB), a routing algorithm that strikes a balance between locality and load balance in torus networks, and analyze RLB’s performance for benign and adversarial traffic permutations. Our results show that RLB outperforms deterministic algorithms (25 % more bandwidth than Dimension Order Routing) and minimal oblivious algorithms (50 % more bandwidth than 2 phase ROMM [9]) on worstcase traffic. At the same time, RLB offers higher throughput on local traffic than a fully randomized algorithm (4.6 times more bandwidth than VAL (Valiant’s algorithm) [15] in the best case). RLBth (RLB threshold) improves the locality of RLB to match the throughput of minimal algorithms on very local traffic in exchange for a 4 % reduction in worstcase throughput compared to RLB. Both RLB and RLBth give better throughput than all other algorithms we tested on randomly selected traffic permutations. While RLB algorithms have somewhat lower guaranteed bandwidth than VAL they have much lower latency at low offered loads (upto 3.65 times less for RLBth).
Randomized PartiallyMinimal Routing on ThreeDimensional Mesh Networks
"... Abstract — This letter presents a new oblivious routing algorithm for 3D mesh networks called Randomized PartiallyMinimal (RPM) routing that provably achieves optimal worstcase throughput for 3D meshes when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd. Although ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract — This letter presents a new oblivious routing algorithm for 3D mesh networks called Randomized PartiallyMinimal (RPM) routing that provably achieves optimal worstcase throughput for 3D meshes when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd. Although this optimality result has been achieved with the minimal routing algorithm O1TURN [9] for the 2D case, the worstcase throughput of O1TURN degrades tremendously in higher dimensions. Other existing routing algorithms suffer from either poor worstcase throughput (DOR [10], ROMM [8]) or poor latency (VAL [14]). RPM on the other hand achieves near optimal worstcase and good averagecase throughput as well as good latency performance. I.
Optimal LoadBalancing
 in Proceedings of IEEE Infocom
, 2005
"... This paper is about loadbalancing packets across multiple paths inside a switch, or across a network. It is motivated by the recent interest in loadbalanced switches. Loadbalanced switches provide an appealing alternative to crossbars with centralized schedulers. A loadbalanced switch has no sch ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
This paper is about loadbalancing packets across multiple paths inside a switch, or across a network. It is motivated by the recent interest in loadbalanced switches. Loadbalanced switches provide an appealing alternative to crossbars with centralized schedulers. A loadbalanced switch has no scheduler, is particularly amenable to optics, and  most relevant here  guarantees 100% throughput. A uniform mesh is used to loadbalance packets uniformly across all 2hop paths in the switch. In this paper we explore whether this particular method of loadbalancing is optimal in the sense that it achieves the highest throughput for a given capacity of interconnect. The method we use allows the loadbalanced switch to be compared with ring, torus and hypercube interconnects, too. We prove that for a given interconnect capacity, the loadbalancing mesh has the maximum throughput. Perhaps surprisingly, we find that the best mesh is slightly nonuniform, or biased, and has a throughput of N/(2N1), where N is the number of nodes.
Nearoptimal oblivious routing on threedimensional mesh networks
 International Conference on Computer Design
, 2008
"... Abstract — The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of twodimensional (2D) meshbased tiled chipmultiprocessor architectures into three dimensions. In this pap ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Abstract — The increasing viability of three dimensional (3D) silicon integration technology has opened new opportunities for chip architecture innovations. One direction is in the extension of twodimensional (2D) meshbased tiled chipmultiprocessor architectures into three dimensions. In this paper, we focus on efficient routing algorithms for such 3D mesh networks. As in the case of 2D mesh networks, throughput and latency are important design metrics for routing algorithms. Existing routing algorithms suffer from either poor worstcase throughput (DOR [1], ROMM [3]) or poor latency (VAL [2]). Although the minimal routing algorithm O1TURN proposed in [4] already achieves nearoptimal worstcase throughput for the 2D case, the optimality result does not extend to higher dimensions. For 3D and higher dimensional meshes, the worstcase throughput of O1TURN degrades tremendously. The main contribution of this paper is the design of a new oblivious routing algorithm for 3D mesh networks called Randomized PartiallyMinimal (RPM) routing. RPM provably achieves optimal worstcase throughput for 3D meshes when the network radix k is even and within a factor of 1/k 2 of optimal worstcase throughput when k is odd. RPM also outperforms VAL, DOR, ROMM, and O1TURN in averagecase throughput by 33.3%, 111%, 47%, and 30%, respectively when averaged over one million random traffic patterns on an 8 × 8 × 8 topology. Finally, whereas VAL achieves optimal worstcase throughput at a penalty factor of 2 in average latency over DOR, RPM achieves (near) optimal worstcase throughput with a much smaller factor of 1.33. In practice, the average latency of RPM is expected to be closer to minimal routing because 3D mesh networks are not expected to be symmetric in 3D chip designs. The number of available device layers is expected to be much less than the number of processor tiles that can be placed along an edge of a device layer. For practical asymmetric 3D mesh configurations, the average latency of RPM reduces to just a factor of 1.11 of DOR. I.
LowOverhead Error Detection for NetworksonChip
"... Abstract — In the current deep submicron age, interconnect reliability is a subject of major concern, and is crucial for a successful product. Coding is a widelyused method to achieve communication reliability, which can be very useful in a NetworkonChip (NoC). A key challenge for NoC error dete ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Abstract — In the current deep submicron age, interconnect reliability is a subject of major concern, and is crucial for a successful product. Coding is a widelyused method to achieve communication reliability, which can be very useful in a NetworkonChip (NoC). A key challenge for NoC error detection is to provide a defined detection level, while minimizing the number of redundant parity bits, using small encoder and decoder circuits, and ensuring shortest path routing. We present Parity Routing (PaR), a novel method to reduce the number of redundant bits transmitted. PaR exploits NoC path diversity to reduce the number of redundant parity bits. Our analysis shows that, for example, on a 4x4 NoC with a demand of one parity bit, PaR reduces the redundant information transmitted by 75%, and the savings increase asymptotically to 100 % with the size of the NoC. In addition, we show that PaR can yield power savings due to the reduced number of bit transmissions and simple decoding process. Furthermore, PaR utilizes low complexity, smallarea circuits. I.