Results 1 - 10
of
69
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery
- In Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memo ..."
Abstract
-
Cited by 90 (7 self)
- Add to MetaCart
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint~recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multi-ple, globally consistent checkpoints of the state of a shared memory muhiprocessor (i.e., processors, memor3; and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic " coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes perfor-mance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an inter-connection network switch (and its buffered messages). Using full-system simulation of a 16-way muhiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. 1
ReVive: CostEffective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
- In ISCA-02
, 2002
"... This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all me ..."
Abstract
-
Cited by 79 (11 self)
- Add to MetaCart
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999 % even when the errors occur as often as once per day. 1
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures
- In International Parallel and Distributed Processing Symposium
, 2002
"... This paper examines the explicit communication characteristics of several sophisticated scientific applications, which, by themselves, constitute a representative suite of publicly available benchmarks for large cluster architectures. By focusing on the Message Passing Interface (MPI) and by using ..."
Abstract
-
Cited by 74 (9 self)
- Add to MetaCart
This paper examines the explicit communication characteristics of several sophisticated scientific applications, which, by themselves, constitute a representative suite of publicly available benchmarks for large cluster architectures. By focusing on the Message Passing Interface (MPI) and by using hardware counters on the microprocessor, we observe each application's inherent behavioral characteristics: point-to-point and collective communication, and floating-point operations. Furthermore, we explore the sensitivities of these characteristics to both problem size and number of processors. Our analysis reveals several striking similarities across our diverse set of applications including the use of collective operations, especially those collectives with very small data payloads. We also highlight a trend of novel applications parting with regimented, static communication patterns in favor of dynamically evolving patterns, as evidenced by our experiments on applications that use implicit linear solvers and adaptive mesh refinement. Overall, our study contributes a better understanding of the requirements of current and emerging paradigms of scientific computing in terms of their computation and communication demands.
STAMP: Stanford Transactional Applications for Multi-Processing
"... Abstract—Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been ev ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
Abstract—Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, which do not stress a wide range of execution scenarios. We introduce the Stanford Transactional Applications for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems. STAMP includes eight applications and thirty variants of input parameters and data sets in order to represent several application domains and cover a wide range of transactional execution cases (frequent or rare use of transactions, large or small transactions, high or low contention, etc.). Moreover, STAMP is portable across many types of TM systems, including hardware, software, and hybrid systems. In this paper, we provide descriptions and a detailed characterization of the applications in STAMP. We also use the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics. I.
Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap between Memory Consistency Models
- In Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures
, 1997
"... This paper studies techniques to improve the performance of memory consistency models for shared-memory multiprocessors with ILP processors. The first part of this paper extends earlier work by studying the impact of current hardware optimizations to memory consistency implementations, hardware-cont ..."
Abstract
-
Cited by 54 (8 self)
- Add to MetaCart
This paper studies techniques to improve the performance of memory consistency models for shared-memory multiprocessors with ILP processors. The first part of this paper extends earlier work by studying the impact of current hardware optimizations to memory consistency implementations, hardware-controlled non-binding prefetching and speculative load execution, on the performance of the processor consistency (PC) memory model. We find that the optimized implementation of PC performs significantly better than the best implementation of sequential consistency (SC) in some cases because PC relaxes the store-to-load ordering constraint of SC. Nevertheless, release consistency (RC) provides significant benefits over PC in some cases, because PC suffers from the negative effects of premature store prefetches and insufficient memory queue sizes. The second part of the paper proposes and evaluates a new technique, speculative retirement, to improve the performance of SC. Speculative retirement ...
DMP: Deterministic Shared Memory Multiprocessing
"... Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded cod ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems.
The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology
- In 3 rd International Symposium on High Performance Computer Architecture
, 1997
"... Current microprocessors exploit high levels of instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and nonblocking reads. This paper presents the first detailed analysis of the impact of such processors on sharedmemory multiprocessors using a detailed e ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
Current microprocessors exploit high levels of instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and nonblocking reads. This paper presents the first detailed analysis of the impact of such processors on sharedmemory multiprocessors using a detailed executiondriven simulator. Using this analysis, we also examine the validity of common direct-execution simulation techniques that employ previous-generation processor models to approximate ILP-based multiprocessors. We find that ILP techniques substantially reduce CPU time in multiprocessors, but are less effective in reducing memory stall time. Consequently, despite the presence of inherent latency-tolerating techniques in ILP processors, memory stall time becomes a larger component of execution time and parallel efficiencies are generally poorer in ILP-based multiprocessors than in previous-generation multiprocessors. Examining the validity of direct-execution simulators with previous-gene...
Analytic evaluation of shared-memory systems with ilp processors
- In ISCA ’98: Proceedings of the 25th annual international symposium on Computer architecture
, 1998
"... This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model is many orders of magnitude faster to solve ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds. The model input parameters characterize the ability of an application to exploit instruction-level parallelism as well as the interaction between the application and the memory system architecture. A trace-driven simulation methodology is developed that allows these parameters to be generated over 100 times faster than with a detailed execution-driven simulator. Finally, this paper shows that the analytical model can be used to gain insights into application performance and to evaluate architectural design trade-offs. 1
Code transformations to improve memory parallelism
- In Proceedings of the 32nd Annual International Symposium on Microarchitecture
, 1999
"... Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than in previou ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However, previous work has shown that these ILP techniques are less effective in removing memory stall time than CPU time, making the memory system a greater bottleneck in ILP-based systems than in previous-generation systems. These deficiencies arise largely because applications present limited opportunities for an out-oforder issue processor to overlap multiple read misses, the dominant source of memory stalls. This work proposes code transformations to increase parallelism in the memory system by overlapping multiple read misses within the same instruction window, while preserving cache locality. We present an analysis and transformation framework suitable for compiler implementation. Our simulation experiments show execution time reductions averaging 20 % in a multiprocessor and 30 % in a uniprocessor. A substantial part of these reductions comes from increases in memory parallelism. We see similar benefits on a Convex Exemplar.
Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors
, 1997
"... Shared-memory multiprocessors are becoming increasingly popular as a highperformance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors du ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
Shared-memory multiprocessors are becoming increasingly popular as a highperformance, easy to program, and relatively inexpensive choice for parallel computation. However, the performance of shared-memory multiprocessors is limited by memory latency. Memory latencies are higher in multiprocessors due to physical constraints and cache coherence overheads. In addition, synchronization operations, which are necessary to ensure correctness in parallel programs, add further communication overhead in shared-memory multiprocessors. Software-controlled non-binding data prefetching is a widely used consumerinitiated mechanism to hide communication latency and is currently supported on most architectures. However, on an invalidation-based cache-coherent multiprocessor, prefetching is inapplicable or insufficient for some communication patterns such as irregular communication, fine-grain pipelined loops, and synchronization. For these cases, a combination of two fine-grain, producer-initiated pr...

