Results 1 - 10
of
21
Atom-Aid: Detecting and Surviving Atomicity Violations
"... Writing shared-memory parallel programs is error-prone. Among the concurrency errors that programmers often face are atomicity violations, which are especially challenging. They happen when programmers make incorrect assumptions about atomicity and fail to enclose memory accesses that should occur a ..."
Abstract
-
Cited by 28 (6 self)
- Add to MetaCart
Writing shared-memory parallel programs is error-prone. Among the concurrency errors that programmers often face are atomicity violations, which are especially challenging. They happen when programmers make incorrect assumptions about atomicity and fail to enclose memory accesses that should occur atomically inside the same critical section. If these accesses happen to be interleaved with conflicting accesses from different threads, the program might behave incorrectly. Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce memory ordering at a coarse grain. This provides what we call implicit atomicity, as the atomic blocks are not derived from explicit program annotations. In this paper, we make the fundamental observation that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. We then propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves. Atom-Aid is also able to report where atomicity violations might exist in the code, providing resilience and debuggability. We evaluate Atom-Aid using buggy code from applications including Apache, MySQL, and XMMS, showing that Atom-Aid virtually eliminates the manifestation of atomicity violations.
Tokentm: Efficient execution of large transactions with hardware transactional memory
- In Proceedings of the 35th Annual International Symposium on Computer Architecture
, 2008
"... Current hardware transactional memory systems seek to simplify parallel programming, but assume that large transactions are rare, so it is acceptable to penalize their performance or concurrency. However, future programmers may wish to use large transactions more often in order to integrate with hig ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Current hardware transactional memory systems seek to simplify parallel programming, but assume that large transactions are rare, so it is acceptable to penalize their performance or concurrency. However, future programmers may wish to use large transactions more often in order to integrate with higher-level programming models (e.g., database transactions) or perform selected I/O operations. To prevent the “small transactions are common” assumption from becoming self-fulfilling, this paper contributes TokenTM—an unbounded HTM that uses the abstraction of tokens to precisely track conflicts on an unbounded number of memory blocks. TokenTM implements tokens with new mechanisms, including
Flexible Decoupled Transactional Memory Support
- IN ISCA
, 2007
"... A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summari ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-thread access sets; per-thread conflict summary tables (CSTs), which identify the threads with which conflicts have occurred; Programmable Data Isolation, which maintains speculative updates in the local cache and employs a thread-private buffer (in virtual memory) in the rare event of overflow; and Alert-On-Update, which selectively notifies threads about coherence events. All mechanisms are softwareaccessible, to enable virtualization and to support transactions of arbitrary length. FlexTM allows software to determine when to manage conflicts (either eagerly or lazily), and to employ a variety of conflict management and commit protocols. We describe an STM-inspired protocol that uses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits. In experiments with a prototype on Simics/GEMS, FlexTM exhibits ∼5 × speedup over high-quality software TM, with no loss in policy flexibility. Its distributed commit protocol is also more efficient than a central hardware manager. Our results highlight the importance of flexibility in determining when to manage conflicts: lazy maximizes concurrency and helps to ensure forward progress while eager provides better overall utilization in a multi-programmed system.
Notary: Hardware Techniques to Enhance Signatures
- In Proc. of the 41st Annual IEEE/ACM International Symp. on Microarchitecture
, 2008
"... Hardware signatures have been recently proposed as an efficient mechanism to detect conflicts amongst concurrently running transactions in transactional memory systems (e.g., Bulk, LogTM-SE, and SigTM). Signatures use fixed hardware to represent an unbounded number of addresses, but may lead to fals ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Hardware signatures have been recently proposed as an efficient mechanism to detect conflicts amongst concurrently running transactions in transactional memory systems (e.g., Bulk, LogTM-SE, and SigTM). Signatures use fixed hardware to represent an unbounded number of addresses, but may lead to false conflicts (detecting a conflict when none exists). Previous work recommends that signatures be implemented with parallel Bloom filters with two or four hash functions (e.g., H3). Two problems exist with current signature designs. First, H3 implementations use many XOR gates. This increases hardware area and power overheads. Second, signature false positives can result from conflicts with signature bits set by private memory addresses that do not require isolation. This paper develops Notary, a coupling of two signature enhancements to ameliorate these problems. First, we use address entropy analysis to develop Page-Block-XOR (PBX) hashing and show it performs similar to H3 at lower hardware cost. Second, we introduce a privatization interface that explicitly allows the programmer to declare shared and private heap memory allocation. Privatization reduces false conflicts arising from private memory accesses and can lead to a reduction in the signature size used. Results from custom transistor-level layouts of H3 and PBX, along with full-system simulation of a 16-core chip-multiprocessor implementing LogTM-SE, show (a) PBX hashing performs similar to H3 hashing while requiring up to 24 % less area and 4.7 % less power overhead and (b) privatization can improve execution time by up to 86 % (by reducing false conflicts by up to 96%). 1.
A Tagless Coherence Directory
"... A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur signifi ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The Tagless Coherence Directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48 % less area, 57 % less leakage power, and 44 % less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5 % increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.
iCFP: Tolerating All-Level Cache Misses in In-Order Processors
"... Growing concerns about power have revived interest in in-order pipelines. In-order pipelines sacrifice single-thread performance. Specifically, they do not allow execution to flow freely around data cache misses. As a result, they have difficulties overlapping independent misses with one another. Pr ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Growing concerns about power have revived interest in in-order pipelines. In-order pipelines sacrifice single-thread performance. Specifically, they do not allow execution to flow freely around data cache misses. As a result, they have difficulties overlapping independent misses with one another. Previously proposed techniques like Runahead execution and Multipass pipelining have attacked this problem. In this paper, we go a step further and introduce iCFP (in-order Continual Flow Pipeline), an adaptation of the CFP concept to an in-order processor. When iCFP encounters a primary data cache or L2 miss, it checkpoints the register file and transitions into an “advance” execution mode. Miss-independent instructions execute as usual and even update register state. Missdependent instructions are diverted into a slice buffer, un-blocking the pipeline latches. When the miss returns, iCFP “rallies ” and executes the contents of the slice buffer, merging miss-dependent state with missindependent state along the way. An enhanced register dependence tracking scheme and a novel store buffer design facilitate the merging process. Cycle-level simulations show that iCFP out-performs Runahead, Multipass, and SLTP, another non-blocking in-order pipeline design. 1.
SigRace: Signaturebased data race detection
- In ISCA
, 2009
"... Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lo ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Detecting data races in parallel programs is important for both software development and production-run diagnosis. Recently, there have been several proposals for hardware-assisted data race detection. Such proposals typically modify the L1 cache and cache coherence protocol messages, and largely lose their capability when lines get displaced or invalidated from the cache. To avoid these shortcomings, this paper proposes a novel approach to hardwareassisted data race detection. The approach, called SigRace, relies on hardware address signatures. As a processor runs, the addresses of the data that it accesses are automatically encoded in signatures. At certain times, the signatures are automatically passed to a hardware module that intersects them with those of other processors. If the intersection is not null, a data race may have occurred. This paper presents the architecture of SigRace, an implementation, and its software interface. With SigRace, caches and coherence protocol messages are unmodi�ed. Moreover, cache lines can be displaced and invalidated with no effect. Our experiments show that SigRace is signi�cantly more effective than a state-ofthe-art conventional hardware-assisted race detector. SigRace �nds on average 29 % more static races and 107 % more dynamic races. Moreover, if we inject data races, SigRace �nds 150 % more static races than the conventional scheme.
Application-specific signatures for transactional memory in soft processors,”in ARC
, 2010
"... Abstract. As reconfigurable computing hardware and in particular FPGA-based systems-on-chip comprise an increasing number of processor and accelerator cores, supporting sharing and synchronization in a way that is scalable and easy to program becomes a challenge. Transactional memory (TM) is a poten ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Abstract. As reconfigurable computing hardware and in particular FPGA-based systems-on-chip comprise an increasing number of processor and accelerator cores, supporting sharing and synchronization in a way that is scalable and easy to program becomes a challenge. Transactional memory (TM) is a potential solution to this problem, and an FPGA-based system provides the opportunity to support TM in hardware (HTM). Although there are many proposed approaches to HTM support for ASICs, these do not necessarily map well to FPGAs. In particular in this work we demonstrate that while signature-based conflict detection schemes (essentially bit vectors) should intuitively be a good match to the bit-parallelism of FPGAs, previous schemes result in either unacceptable multicycle stalls, operating frequencies, or falseconflict rates. Capitalizing on the reconfigurable nature of FPGA-based systems, we propose an application-specific signature mechanism for HTM conflict detection. Using both real and projected FPGA-based soft multiprocessor systems that support HTM and implement threaded, shared-memory network packet processing applications, relative to signatures with bit selection we find that our application-specific approach (i) maintains a reasonable operating frequency of 125MHz, (ii) has an area overhead of only 5%, and (iii) achieves a 9 % to 71 % increase in packet throughput due to reduced false conflicts. 1
OS Support for Virtualizing Hardware Transactional Memory
, 2008
"... Transactional memory promises to simplify multithreaded programming. Hardware TM (HTM) implementations promise better performance by augmenting processors with transactional state. However, HTMs interact poorly with the operating system or virtual machine monitor. For example, they often do not tole ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Transactional memory promises to simplify multithreaded programming. Hardware TM (HTM) implementations promise better performance by augmenting processors with transactional state. However, HTMs interact poorly with the operating system or virtual machine monitor. For example, they often do not tolerate OS actions that virtualize processors and memory, such as context switching and paging. Without support for these actions, an HTM may not execute programs correctly or guarantee forward progress. We investigate virtualizing transactional memory in the context of LogTM-SE. First, we describe an implementation of a kernel module in OpenSolaris that implements transactional virtualization and requires only 1120 lines of code. Second, we find that LogTM-SE interacts poorly with virtual machine monitors due to a reliance on physical addresses. We propose an extension to LogTM-SE, called LogTM-VSE, that addresses these problems and improves context-switching performance. Third, through application tracing on real hardware and full system simulation, we show virtualizing transactions can be necessary for system stability and to support code that voluntarily context switches. However, we find that aborting a transaction is generally faster than virtualizing it, and hence preferable in some cases.
SIGNATURES IN TRANSACTIONAL MEMORY SYSTEMS
"... Transactional memory (TM) is an emerging parallel programming paradigm. It seeks to ease common problems with lock-based parallel programming (e.g., deadlock, composability) through the use of novel language-level constructs. Researchers have proposed many TM systems to implement TM programming sema ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Transactional memory (TM) is an emerging parallel programming paradigm. It seeks to ease common problems with lock-based parallel programming (e.g., deadlock, composability) through the use of novel language-level constructs. Researchers have proposed many TM systems to implement TM programming semantics, including all-hardware, all-software, or a hybrid of hardware and software support. We focus on hardware TM systems (HTMs) due to their performance advantages over the other implementations. Early HTM proposals restrict the type of transactions that can be run — either its size or its time — and this may unnecessarily burden TM programmers. Such burdens may slow or even squelch the adoption of TM as a wide-spread parallel programming model. Subsequent HTM proposals support virtualizing transactions (in both space and time), but may incur high overheads when handling common virtualization events (e.g., cache replacements of transactional data), or complicate existing hardware (HW) designs (e.g., L1 cache). An ideal HTM design should have the following characteristics: (1) Allow the execution of transactions of arbitrary size and time and (2) Minimize changes to existing processor cores and systems. This dissertation makes several contributions in the design and analysis of a HTM with these properties.

