Results 1 - 10
of
56
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution
"... The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
The behavior of a multithreaded program does not depend only on its inputs. Scheduling, memory reordering, timing, and low-level hardware effects all introduce nondeterminism in the execution of multithreaded programs. This severely complicates many tasks, including debugging, testing, and automatic replication. In this work, we avoid these complications by eliminating their root cause: we develop a compiler and runtime system that runs arbitrary multithreaded C/C++ POSIX Threads programs deterministically. A trivial non-performant approach to providing determinism is simply deterministically serializing execution. Instead, we present a compiler and runtime infrastructure that ensures determinism but resorts to serialization rarely, for handling interthread communication and synchronization. We develop two basic approaches, both of which are largely dynamic with performance improved by some static compiler optimizations. First, an ownership-based approach detects interthread communication via an evolving table that tracks ownership of memory regions by threads. Second, a buffering approach uses versioned memory and employs a deterministic commit protocol to make changes visible to other threads. While buffering has larger single-threaded overhead than ownership, it tends to scale better (serializing less often). A hybrid system sometimes performs and scales better than either approach individually. Our implementation is based on the LLVM compiler infrastructure. It needs neither programmer annotations nor special hardware. Our empirical evaluation uses the PARSEC and SPLASH2 benchmarks and shows that our approach scales comparably to nondeterministic execution.
A better x86 memory model: x86-TSO
- In TPHOLs’09: Conference on Theorem Proving in Higher Order Logics, volume 5674 of LNCS
, 2009
"... Abstract. Real multiprocessors do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, typically described in ambiguous prose, which lead to widespread confusion. These are prime targets for mechanized fo ..."
Abstract
-
Cited by 20 (8 self)
- Add to MetaCart
Abstract. Real multiprocessors do not provide the sequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, typically described in ambiguous prose, which lead to widespread confusion. These are prime targets for mechanized formalization. In previous work we produced a rigorous x86-CC model, formalizing the Intel and AMD architecture specifications of the time, but those turned out to be unsound with respect to actual hardware, as well as arguably too weak to program above. We discuss these issues and present a new x86-TSO model that suffers from neither problem, formalized in HOL4. We believe it is sound with respect to real processors, reflects better the vendor’s intentions, and is also better suited for programming. We give two equivalent definitions of x86-TSO: an intuitive operational model based on local write buffers, and an axiomatic total store ordering model, similar to that of the SPARCv8. Both are adapted to handle x86-specific features. We have implemented the axiomatic model in ourmemevents tool, which calculates the set of all valid executions of test programs, and, for greater confidence, verify the witnesses of such executions directly, with code extracted from a third, more algorithmic, equivalent version of the definition. 1
LiteRace: effective sampling for lightweight data-race detection
- In PLDI
, 2009
"... Data races are one of the most common and subtle causes of pernicious concurrency bugs. Static techniques for preventing data races are overly conservative and do not scale well to large programs. Past research has produced several dynamic data race detectors that can be applied to large programs an ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
Data races are one of the most common and subtle causes of pernicious concurrency bugs. Static techniques for preventing data races are overly conservative and do not scale well to large programs. Past research has produced several dynamic data race detectors that can be applied to large programs and are precise in the sense that they only report actual data races. However, these dynamic data race detectors incur a high performance overhead, slowing down a program’s execution by an order of magnitude. In this paper we present FeatherLite, a very lightweight data race detector that samples and analyzes only selected portions of a program’s execution. We show that it is possible to sample a multi-threaded program at a low frequency and yet find infrequently occurring data races. We implemented FeatherLite using Microsoft’s Phoenix compiler. Our experiments with several Microsoft programs show that FeatherLite is able to find more than 75 % of data races by sampling less than 5 % of memory accesses in a given program execution. 1.
Memory Models: A Case for Rethinking Parallel Languages and Hardware
- COMMUNICATIONS OF THE ACM
, 2010
"... The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of share ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
The era of parallel computing for the masses is here, but writing correct parallel programs remains far more difficult than writing sequential programs. Aside from a few domains, most parallel programs are written using a shared-memory approach. The memory model, which specifies the meaning of shared variables, is at the heart of this programming model. Unfortunately, it has involved a tradeoff between programmability and performance, and has arguably been one of the most challenging and contentious areas in both hardware architecture and programming language specification. Recent broad community-scale efforts have finally led to a convergence in this debate, with popular languages such as Java and C++ and most hardware vendors publishing compatible memory model specifications. Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings in current
x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors
"... Exploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concurrent systems code, for concurrent data structures, operating system kernels, synchronisation libraries, compilers, and so on. However, concurrent programming, which is always challenging, ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
Exploiting the multiprocessors that have recently become ubiquitous requires high-performance and reliable concurrent systems code, for concurrent data structures, operating system kernels, synchronisation libraries, compilers, and so on. However, concurrent programming, which is always challenging, is made much more so by two problems. First, realmultiprocessorstypicallydonotprovidethesequentially consistent memory that is assumed by most work on semantics and verification. Instead, they have relaxed memory models, varying in subtle ways between processor families, in which different hardware threads may have only loosely consistent views of a shared memory. Second, the public vendor architectures, supposedly specifying what programmers can rely on, are often in ambiguous informal prose (a particularly poor medium for loose specifications), leading to widespread confusion. In this paper we focus on x86 processors. We review several recent Intel and AMD specifications, showing that all contain serious ambiguities, some are arguably too weak to program above, and some are simply unsound with respect to actual hardware. We present a new x86-TSO programmer’s model that, to the best of our knowledge, suffers from none of these problems. It is mathematically precise (rigorously defined in HOL4) but can be presented as an intuitive abstract machine which should be widely accessible to working programmers. We illustrate how this can be used to reason about the correctness of a Linux spinlock implementationanddescribeageneraltheoryofdata-race-freedomfor x86-TSO. This should put x86 multiprocessor system building on a more solid foundation; it should also provide a basis for future work on verification of such systems. 1.
The Semantics of Power and ARM Multiprocessor Machine Code
"... We develop a rigorous semantics for Power and ARM multiprocessor programs, including their relaxed memory model and the behaviour of reasonable fragments of their instruction sets. The semantics is mechanised in the HOL proof assistant. This should provide a good basis for informal reasoning and for ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
We develop a rigorous semantics for Power and ARM multiprocessor programs, including their relaxed memory model and the behaviour of reasonable fragments of their instruction sets. The semantics is mechanised in the HOL proof assistant. This should provide a good basis for informal reasoning and formal verification of low-level code for these weakly consistent architectures, and, together with our x86 semantics, for the design and compilation of high-level concurrent languages.
Ordering-Based Semantics for Software Transactional Memory
- IN PROC. OF THE 12TH INTL. CONF. ON PRINCIPLES OF DISTRIBUTED SYSTEMS
, 2008
"... It has been widely suggested that memory transactions should behave as if they acquired and released a single global lock. Unfortunately, this behavior can be expensive to achieve, particularly when— as in the natural publication/privatization idiom—the same data are accessed both transactionally an ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
It has been widely suggested that memory transactions should behave as if they acquired and released a single global lock. Unfortunately, this behavior can be expensive to achieve, particularly when— as in the natural publication/privatization idiom—the same data are accessed both transactionally and nontransactionally. To avoid overhead, we propose selective strict serializability (SSS) semantics, in which transactions have a global total order, but nontransactional accesses are globally ordered only with respect to explicitly marked transactions. Our definition of SSS formally characterizes the permissible behaviors of an STM system without recourse to locks. If all transactions are marked, then SSS, singlelock semantics, and database-style strict serializability are equivalent. We evaluate several SSS implementations in the context of a TL2-like STM system. We also evaluate a weaker model, selective flow serializability (SFS), which is similar in motivation to the asymmetric lock atomicity of Menon et al. We argue that ordering-based semantics are conceptually preferable to lock-based semantics, and just as efficient.
Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism
"... Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a ne ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Deterministic replay systems record and reproduce the execution of a hardware or software system. While it is well known how to replay uniprocessor systems, replaying shared memory multiprocessor systems at low overhead on commodity hardware is still an open problem. This paper presents Respec, a new way to support deterministic replay of shared memory multithreaded programs on commodity multiprocessor hardware. Respec targets online replay in which the recorded and replayed processes execute concurrently. Respec uses two strategies to reduce overhead while still ensuring correctness: speculative logging and externally deterministic replay. Speculative logging optimistically logs less information about shared memory dependencies than is needed to guarantee deterministic replay, then recovers and retries if the replayed process diverges from the recorded process. Externally deterministic replay relaxes the degree to which the two executions must match by requiring only their system output and final program states match. We show that the combination of these two techniques results in low recording and replay overhead for the common case of datarace-free execution intervals and still ensures correct replay for execution intervals that have data races. We modified the Linux kernel to implement our techniques. Our software system adds on average about 18 % overhead to the execution time for recording and replaying programs with two threads and 55 % overhead for programs with four threads.
Detecting and Tolerating Asymmetric races
"... Because data races represent a hard-to-manage class of errors in concurrent programs, numerous approaches to detect them have been proposed and evaluated. We specifically consider asymmetric races, a subclass of all race conditions, where a programmer’s thread correctly acquires and releases a lock ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Because data races represent a hard-to-manage class of errors in concurrent programs, numerous approaches to detect them have been proposed and evaluated. We specifically consider asymmetric races, a subclass of all race conditions, where a programmer’s thread correctly acquires and releases a lock for a given variable, while another thread causes a race by improperly accessing this variable. We introduce ToleRace, a runtime system that allows programs to either tolerate or detect asymmetric races based on local replication of shared state. ToleRace provides an approximation of atomicity in critical sections by creating local copies of shared variables when a critical section is entered and propagating the appropriate copy when the critical section is exited. We characterize the possible interleavings that can cause races and precisely describe the effect of ToleRace in each case. We study the theoretical aspects of an oracle that knows exactly what type of interleaving has occurred. Then, we present a software implementation of ToleRace on top of a dynamic instrumentation tool. We evaluate our implementation on multithreaded applications from the SPLASH2 and PARSEC suites and show that its overhead is acceptable, i.e., a factor of two on average.
Mathematizing C++ Concurrency
"... Shared-memory concurrency in C and C++ is pervasive in systems programming, but has long been poorly defined. This motivated an ongoing shared effort by the standards committees to specify concurrent behaviour in the next versions of both languages. They aim to provide strong guarantees for race-fre ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
Shared-memory concurrency in C and C++ is pervasive in systems programming, but has long been poorly defined. This motivated an ongoing shared effort by the standards committees to specify concurrent behaviour in the next versions of both languages. They aim to provide strong guarantees for race-free programs, together with new (but subtle) relaxed-memory atomic primitives for highperformance concurrent code. However, the current draft standards, while the result of careful deliberation, are not yet clear and rigorous definitions, and harbour substantial problems in their details. In this paper we establish a mathematical (yet readable) semantics for C++ concurrency. We aim to capture the intent of the current (‘Final Committee’) Draft as closely as possible, but discuss changes that fix many of its problems. We prove that a proposed x86 implementation of the concurrency primitives is correct with respect to the x86-TSO model, and describe our CPPMEM tool for exploring the semantics of examples, using code generated from our Isabelle/HOL definitions. Having already motivated changes to the draft standard, this work will aid discussion of any further changes, provide a correctness condition for compilers, and give a much-needed basis for analysis and verification of concurrent C and C++ programs.

