Results 1 - 10
of
11
Load-Balanced Combined Input-Crosspoint Buffered Packet Switches
, 2011
"... Combined input-crosspoint buffered (CICB) switches can achieve high switching performance without speedup. However, the dedicated crosspoint buffers in a CICB switch may not be efficiently used, and throughput degradation may occur. This throughput degradation is especially observable under flows wi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Combined input-crosspoint buffered (CICB) switches can achieve high switching performance without speedup. However, the dedicated crosspoint buffers in a CICB switch may not be efficiently used, and throughput degradation may occur. This throughput degradation is especially observable under flows with high data rates and long distances between the line cards and the buffered crossbar. This paper introduces two load-balanced CICB switches: the load-balancing CICB switch with full access (LB-CICB-FA) and the load-balancing CICB switch with single access (LB-CICB-SA). The proposed switches use the crosspoint buffers efficiently and support long distances between the line cards and buffered crossbar with crosspoint buffers smaller than those in a CICB switch by a factor of N, where N is the number of ports. It is proven that the LB-CICB-FA switch with random selection of the configuration of the load-balancing stage, input queues, and crosspoint queues is weakly stable under admissible independent and identical distributed (i.i.d.) traffic. Additional simulation results support the correctness of the theoretical analysis. Furthermore, it is shown that the throughput of the LB-CICB-SA switch with the longest-queue first (LQF) and first-come first-served (FCFS) as input and output arbitrations, respectively, is 100% under admissible i.i.d. traffic. The proposed switches keep cells in sequence and use no speedup. The low implementation complexity of the load-balancing stage is discussed and shown to be small.
G.R.: A discussion in favor of Dynamic Scheduling for regular applications in Many-core Architectures
- In: Proceedings of 2012 Workshop on Multithreaded Architectures and Applications (MTAAP 2012); 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS
, 2012
"... Abstract—The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and more frequently characterized by the diversity of their resources and the ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
(Show Context)
Abstract—The recent evolution of many-core architectures has resulted in chips where the number of processor elements (PEs) are in the hundreds and continue to increase every day. In addition, many-core processors are more and more frequently characterized by the diversity of their resources and the way the sharing of those resources is arbitrated. On such machines, task scheduling is of paramount impor-tance to orchestrate a satisfactory distribution of tasks with an efficient utilization of resources, especially when fine-grain parallelism is desired or required. In the past, the primary focus of scheduling techniques has been on achieving load balancing and reducing overhead with the aim to increase total performance. This focus has resulted in a scheduling paradigm where Static Scheduling (SS) is preferred to Dynamic Scheduling (DS) for highly regular and embarrassingly parallel applications running on homogeneous architectures. We have revisited the task scheduling problem for these types of applications under the scenario imposed by many-core architectures to investigate whether or not there exists scenarios where DS is better than SS. Our main contribution is the idea that, for highly regular and embarrassingly parallel applications, DS is preferable to SS in some situations commonly found in many-core architec-tures. We present experimental evidence that shows how the performance of SS is degraded by the new environment on many-core chips. We analyze three reasons that contribute to the superiority of DS over SS on many-core architectures under the situations described: 1) A uniform mapping of work to processors without considering the granularity of tasks is not necessarily scalable under limited amounts of work. 2) The presence of shared resources (i.e. the crossbar switch) produces unexpected and stochastic variations on the duration of tasks that SS is unable to manage properly. 3) Hardware features, such as in-memory atomic opera-tions, greatly contribute to decrease the overhead of DS. I.
The Crosspoint-Queued Switch
"... Abstract—This paper calls for rethinking packet-switch architectures by cutting all dependencies between the switch fabric and the linecards. Most single-stage packet-switch architectures rely on an instantaneous communication between the switch fabric and the linecards. Today, however, this assumpt ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract—This paper calls for rethinking packet-switch architectures by cutting all dependencies between the switch fabric and the linecards. Most single-stage packet-switch architectures rely on an instantaneous communication between the switch fabric and the linecards. Today, however, this assumption is breaking down, because effective propagation times are too high and keep increasing with the line rates. In this paper, we argue for a self-sufficient switch fabric by moving all the buffering from the linecards to the switch fabric. We introduce the crosspoint-queued (CQ) switch, a new bufferedcrossbar switch architecture with large crosspoint buffers and no input queues, and show how it can be readily implemented in a single SRAM-based chip using current technology. For a crosspoint buffer size of one, we provide a closed-form throughput formula for all work-conserving schedules under uniform Bernoulli i.i.d. arrivals. Furthermore, we study the performance of the switch for larger buffer sizes and show that it nearly behaves as an ideal output-queued switch. Finally, we confirm our results using synthetic as well as trace-based simulations. I.
Performance Evaluation of Novel Combined Input Crossbar Queue Switch with Virtual Crossbar Queues
, 2007
"... Special version of the crossbar switch is Combined Input Crossbar Queue switch (CICQ), where buffers are employed at each crosspoint in addition to buffering at each input port. In this architecture, Round Trip Time (RTT) is defined as the sum of the delays of: i) the input arbitration, ii) the tran ..."
Abstract
- Add to MetaCart
Special version of the crossbar switch is Combined Input Crossbar Queue switch (CICQ), where buffers are employed at each crosspoint in addition to buffering at each input port. In this architecture, Round Trip Time (RTT) is defined as the sum of the delays of: i) the input arbitration, ii) the transmission of a packet from an input port to the crossbar, iii) the output arbitration, and iv) the transmission of the flow-control information back from the crossbar to the input port. It depends on distance between switch core and line card. As the switch core physically located far from the input ports, RTT can be long. It has been shown that throughput degradation as a function of Round Trip Time and Crosspoint Buffer size. To support a longer RTT, larger Crosspoint Buffer is needed. But, larger Crosspoint Buffer limits the port count. To support longer RTT, recently new architecture is proposed known as Combined Input Crossbar Queue Switch with Virtual Crossbar Queues. In this paper, performance of
Performance Analysis of a Buffered Crossbar Switch
"... Abstract: This paper analyzes the performance of a buffered crossbar switch under bursty traffic. It derives the saturated throughput for a buffered crossbar switch with multiple queues at each input port by the proposed analytic model. The saturation throughput sharply decreases from 1 and converge ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: This paper analyzes the performance of a buffered crossbar switch under bursty traffic. It derives the saturated throughput for a buffered crossbar switch with multiple queues at each input port by the proposed analytic model. The saturation throughput sharply decreases from 1 and converges to 0.5 with the increasing of average burst length, and it approaches 1 as the number of queues per input increases. The accuracy of the theoretic analysis is also investigated by extensive simulation. Results from this paper can be used as a guidance to design optimal buffered crossbar switches. Key words: buffered crossbar switch; input queuing; scheduling; modeling; performance analysis
Efficient Variable Length Block Switching Mechanism
"... Abstract: Most popular and widely used packet switch architecture is the crossbar. Its attractive characteristics are simplicity, non-blocking and support for simultaneous multiple packet transmission across the switch. The special version of crossbar switch is Combined Input Crossbar Queue (CICQ) s ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: Most popular and widely used packet switch architecture is the crossbar. Its attractive characteristics are simplicity, non-blocking and support for simultaneous multiple packet transmission across the switch. The special version of crossbar switch is Combined Input Crossbar Queue (CICQ) switch. It overcomes the limitations of un-buffered crossbar by employing buffers at each crosspoint in addition to buffering at each input port. Adoption of Crosspoint Buffer (CB) simplifies the scheduling complexity and adapts the distributed nature of scheduling. As a result, matching operation is not needed. Moreover, it supports variable length packets transmission without segmentation. Native switching of variable length packet transmission results in unfairness. To overcome this unfairness, Fixed Length Block Transfer mechanism has been proposed. It has the following drawbacks: (a) Fragmented packets are reassembled at the Crosspoint Buffer (CB). Hence, minimum buffer requirement at each crosspoint is twice the maximum size of the block. When number of ports are more, existence of such a switch is infeasible, due to the restricted memory available in switch core. (b) Reassembly circuit at each crosspoint adds the cost of the switch. (c) Packet is eligible to transfer from CB to output only when the entire packet arrives at the CB, which increases the latency of the fragmented packet in the switch. To overcome these drawbacks, this paper presents Variable Length Block Transfer mechanism. It does not require internal speedup, segmentation and reassembly circuits. Using simulation it is shown that proposed mechanism is superior to Fixed Length Block Transfer mechanism in terms of delay and throughput.
The Crosspoint-Queued Switch (Extended Version)
"... Abstract—This paper calls for rethinking packet-switch architectures, by cutting all dependencies between the switch fabric and the linecards. Most single-stage packet-switch architectures rely on an instantaneous communication between the switch fabric and the linecards. Today, however, this assump ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—This paper calls for rethinking packet-switch architectures, by cutting all dependencies between the switch fabric and the linecards. Most single-stage packet-switch architectures rely on an instantaneous communication between the switch fabric and the linecards. Today, however, this assumption is breaking down, because effective propagation times are too high and keep increasing with the line rates. In this paper, we argue for a self-sufficient switch fabric, by moving all the buffering from the linecards to the switch fabric. We introduce the crosspoint-queued (CQ) switch, a new bufferedcrossbar switch architecture with large crosspoint buffers and no input queues. We study the performance of the switch and show that it nearly behaves as an ideal output-queued switch as the buffer size increases. Further, with a small crosspoint buffer size, we provide a closed-form throughput formula for all work-conserving schedules with uniform Bernoulli i.i.d. arrivals. Finally, we show how the CQ switch can be practically implemented in a single SRAM-based chip. A. Background I.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Limited circulation. For review only 1 Stability and Control of Acyclic Stochastic Processing N
"... Abstract—We consider a general model framework for acyclic stochastic processing networks with shared resources that has many applications in telecommunication, computer, and manufacturing systems. A dynamic control policy that utilizes the maximal matching (for scheduling) and the join-the-shortest ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—We consider a general model framework for acyclic stochastic processing networks with shared resources that has many applications in telecommunication, computer, and manufacturing systems. A dynamic control policy that utilizes the maximal matching (for scheduling) and the join-the-shortest-queue (for routing) discipline, is shown to maximize the throughput and stabilize the system in a sense called “uniform mean recurrence time property ” under fairly mild stochastic assumptions. Owing to the non-Markovian nature of the states, system stability is established using a perturbed Lyapunov function method. Index Terms—Acyclic network, control, stability, maximal throughput, perturbed Lyapunov function method.
A STUDY ON BUFFERED CROSSBAR SWITCH SCHEDULING ALGORITHMS
"... The increasing demand for higher data rates on the Internet requires routers that deliver high performance for high-speed connections. Nowadays high speed routers use the buffered crossbar switches, which have been the interest for research and commercialization. In this paper, a study is made on th ..."
Abstract
- Add to MetaCart
The increasing demand for higher data rates on the Internet requires routers that deliver high performance for high-speed connections. Nowadays high speed routers use the buffered crossbar switches, which have been the interest for research and commercialization. In this paper, a study is made on the importance of buffered crossbar switches and their scheduling algorithms. A comparative analysis is made between these scheduling algorithms.
A Low Complexity Scheduling Algorithm for a Crosspoint Buffered Switch with 100 % Throughput
"... Abstract-Crosspoint butTered switches are emerging as the focus of research in high-speed routers. They have simpler scheduling algorithms, and achieve better performance than a bufferless crossbar switch. Crosspoint butTered switches have a buffer at each crosspoint. A cell is first delivered to a ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-Crosspoint butTered switches are emerging as the focus of research in high-speed routers. They have simpler scheduling algorithms, and achieve better performance than a bufferless crossbar switch. Crosspoint butTered switches have a buffer at each crosspoint. A cell is first delivered to a crosspoint butter, and then transferred to the output port. With a speedup of two, a crosspoint butTered switch has previously been proved to provide 100 % throughput. In this paper, we propose a 100% throughput scheduling algorithm without speedup, called SQUID. With this design, each input/output keeps track of the previously served virtual output queues (VOQs)/crosspoint butters. We prove that SQUID, with a time complexity of G(log N), can achieve 100 % throughput without any speedup. Our simulation results also show a delay performance comparable to output-queued switches. We also present a novel queuing model that models crosspoint butTered switches under uniform traffic. I.