Results 1 - 10
of
44
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery
- In Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memo ..."
Abstract
-
Cited by 90 (7 self)
- Add to MetaCart
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memory muhiprocessor (i.e., processors, memor3; and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic " coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes perfor-mance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an inter-connection network switch (and its buffered messages). Using full-system simulation of a 16-way muhiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. 1
HIPIQS: A High-Performance Switch Architecture using Input Queuing
- In Proceedings of the 12th International Parallel Processing Symposium
, 1998
"... Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both e ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Switch-based interconnects are used in a number of application domains including parallel system interconnects, local area networks, and wide area networks. However, very few switches have been designed that are suitable for more than one of these application domains. Such a switch must offer both extremely low latency and very high throughput for a variety of different message sizes. While some architectures with output queuing have been shown to perform extremely well in terms of throughput, their performance can suffer when used in systems where a significant portion of the packets are extremely small. On the other hand, architectures with input queuing offer limited throughput, or require fairly complex and centralized arbitration that increases latency. In this paper we present a new input queue-based switch architecture called HIPIQS (HIgh-Performance Input-Queued Switch). It offers low latency for a range of message sizes, and provides throughput comparable to that of output qu...
The Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols
, 1998
"... that I have read this dissertation and that in my opinion it is ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
that I have read this dissertation and that in my opinion it is
A Case for Bufferless Routing in On-Chip Networks
- ISCA'09
, 2009
"... Buffers in on-chip networks consume significant energy, occupy chip area, and increase design complexity. In this paper, we make a case for a new approach to designing on-chip interconnection networks that eliminates the need for buffers for routing or flow control. We describe new algorithms for ro ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Buffers in on-chip networks consume significant energy, occupy chip area, and increase design complexity. In this paper, we make a case for a new approach to designing on-chip interconnection networks that eliminates the need for buffers for routing or flow control. We describe new algorithms for routing without using buffers in router input/output ports. We analyze the advantages and disadvantages of bufferless routing and discuss how router latency can be reduced by taking advantage of the fact that input/output buffers do not exist. Our evaluations show that routing without buffers significantly reduces the energy consumption of the on-chip cache/processor-to-cache network, while providing similar performance to that of existing buffered routing algorithms at low network utilization (i.e., on most real applications). We conclude that bufferless routing can be an attractive and energy-efficient design option for onchip cache/processor-to-cache networks where network utilization is low.
Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks
"... Future chip multiprocessors (CMPs) may have hundreds to thousands of threads competing to access shared resources, and will require quality-of-service (QoS) support to improve system utilization. Although there has been significant work in QoS support within resources such as caches and memory contr ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Future chip multiprocessors (CMPs) may have hundreds to thousands of threads competing to access shared resources, and will require quality-of-service (QoS) support to improve system utilization. Although there has been significant work in QoS support within resources such as caches and memory controllers, there has been less attention paid to QoS support in the multi-hop on-chip networks that will form an important component in future systems. In this paper we introduce Globally-Synchronized Frames (GSF), a framework for providing guaranteed QoS in onchip networks in terms of minimum bandwidth and a maximum delay bound. The GSF framework can be easily integrated in a conventional virtual channel (VC) router without significantly increasing the hardware complexity. We rely on a fast barrier network, which is feasible in an on-chip environment, to efficiently implement GSF. Performance guarantees are verified by both analysis and simulation. According to our simulations, all concurrent flows receive their guaranteed minimum share of bandwidth in compliance with a given bandwidth allocation. The average throughput degradation of GSF on a 8×8 mesh network is within 10 % compared to the conventional best-effort VC router in most cases. 1
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism
- 31th Annual International Symposium on Computer Architecture
, 2004
"... Abstract 1 A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resource ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Abstract 1 A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resources. Immunet has four important advantages over previous fault-tolerant switching mechanisms. Its low hardware costs minimize the overhead that the network must support in absence of faults. As long as the network remains connected, Immunet can tolerate any number of failures regardless of their spatial and temporal combinations. The resulting communication infrastructure provides optimized adaptive minimal routing over the surviving topology. The system behavior under successive failures exhibits graceful performance degradation. Immunet reconfiguration can be totally transparent to the applications running on the parallel system as they will only be affected by the loss of those data packets circulating through the broken components. The rest of the packets will suffer only a tolerable delay induced by the time employed to perform the automatic network reconfiguration. Descriptions of the hardware network architecture and detailed synthetic and execution-driven simulations will demonstrate the benefits of Immunet. 1.
Pipelined Multi-Queue Management in a VLSI ATM Switch Chip with Credit-Based Flow-Control
, 1997
"... : We describe the queue management block of ATLAS I , a single-chip ATM switch (router) with optional credit-based (backpressure) flow control. ATLAS I is a 4-million-transistor 0.35-micron CMOS chip, currently under development, offering 20 Gbit/s aggregate I/O throughput, sub-microsecond cut-throu ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
: We describe the queue management block of ATLAS I , a single-chip ATM switch (router) with optional credit-based (backpressure) flow control. ATLAS I is a 4-million-transistor 0.35-micron CMOS chip, currently under development, offering 20 Gbit/s aggregate I/O throughput, sub-microsecond cut-through latency, 256-cell shared buffer containing multiple logical output queues, priorities, multicasting, and load monitoring. The queue management block of ATLAS I is a dual parallel pipeline that manages the multiple queues of ready cells, the per-flow-group credits, and the cells that are waiting for credits. All cells, in all queues, share one, common buffer space. These 3- and 4-stage pipelines handle events at the rate of one cell arrival or departure per clock cycle, and one credit arrival per clock cycle. The queue management block consists of two compiled SRAM's, pipeline bypass logic, and multi-port CAM and SRAM blocks that are laid out in full-custom and support special access Cop...
Credit-Flow-Controlled ATM for MP Interconnection: the ATLAS I Single-Chip ATM Switch
- IN PROCEEDINGS OF THE FOURTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1998
"... Multiprocessing (MP) on networks of workstations (NOW) is a high-performance computing architecture of growing importance. In traditional MP's, wormhole routing interconnection networks use fixed-size flits and backpressure. In NOW's, ATM-one of the major contending interconnection technologies- use ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Multiprocessing (MP) on networks of workstations (NOW) is a high-performance computing architecture of growing importance. In traditional MP's, wormhole routing interconnection networks use fixed-size flits and backpressure. In NOW's, ATM-one of the major contending interconnection technologies- uses fixed-size cells, while backpressure can be added to it. We argue that ATM with backpressure has interesting similarities with wormhole routing. We are implementing ATLAS I, a single-chip gigabit ATM switch, which includes credit flow control (backpressure), according to a protocol resembling Quantum Flow Control (QFC). We show by simulation that this protocol performs better than the traditional multi-lane wormhole protocol: high throughput and low latency are provided with less buffer space. Also, ATLAS I demonstrates little sensitivity to bursty traffic, and, unlike wormhole, it is fair in terms of latency in hot-spot configurations. We use detailed switch models, operating at clock-cyc...
The Impact of Link Arbitration on Switch Performance
- IN PROCEEDINGS OF THE FIFTH SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... Switch design for interconnection networks plays an important role in the overall performance of multiprocessors and computer networks. In this paper we study the impact of one parameter in the switch design space, link arbitration. We demonstrate that link arbitration can be a determining factor in ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Switch design for interconnection networks plays an important role in the overall performance of multiprocessors and computer networks. In this paper we study the impact of one parameter in the switch design space, link arbitration. We demonstrate that link arbitration can be a determining factor in the performance of current networks. Moreover, we expect increased research focus on arbitration techniques to become a trend in the future, as switch architectures evolve towards increasing the number of virtual channels and input ports. In the context of a state-of-the-art switch design we use both synthetic workload and execution driven simulations to compare several arbitration policies. Furthermore, we devise a new arbitration method, Look-Ahead arbitration. Under heavy traffic conditions the Look-Ahead policy provides significant improvements over traditional arbitration schemes without a significant increase in hardware complexity. Also, we propose a priority based policy that is ca...
SMTp: An Architecture for Next-generation Scalable Multi-threading
- In ISCA’04
, 2004
"... We introduce the SMTp architecture—an SMT processor augmented with a coherence protocol thread context, that together with a standard integrated memory controller can enable the design of (among other possibilities) scalable cache-coherent hardware distributed shared memory (DSM) machines from commo ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We introduce the SMTp architecture—an SMT processor augmented with a coherence protocol thread context, that together with a standard integrated memory controller can enable the design of (among other possibilities) scalable cache-coherent hardware distributed shared memory (DSM) machines from commodity nodes. We describe the minor changes needed to a conventional out-of-order multithreaded core to realize SMTp, discussing issues related to both deadlock avoidance and performance. We then compare SMTp performance to that of various conventional DSM machines with normal SMT processors both with and without integrated memory controllers. On configurations from 1 to 32 nodes, with 1 to 4 application threads per node, we find that SMTp delivers performance comparable to, and sometimes better than, machines with more complex integrated DSM-specific memory controllers. Our results also show that the protocol thread has extremely low pipeline overhead. Given the simplicity and the flexibility of the SMTp mechanism, we argue that next-generation multithreaded processors with integrated memory controllers should adopt this mechanism as a way of building less complex high-performance DSM multiprocessors. 1.

