Results 1 - 10
of
53
Logtm: Log-based transactional memory
- in HPCA
, 2006
"... Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (r ..."
Abstract
-
Cited by 173 (8 self)
- Add to MetaCart
Transactional memory (TM) simplifies parallel programming by guaranteeing that transactions appear to execute atomically and in isolation. Implementing these properties includes providing data version management for the simultaneous storage of both new (visible if the transaction commits) and old (retained if the transaction aborts) values. Most (hardware) TM systems leave old values “in place” (the target memory address) and buffer new values elsewhere until commit. This makes aborts fast, but penalizes (the much more frequent) commits. In this paper, we present a new implementation of transactional memory, Log-based Transactional Memory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place. LogTM makes two additional contributions. First, LogTM extends a MOESI directory protocol to enable both fast conflict detection on evicted blocks and fast commit (using lazy cleanup). Second, LogTM handles aborts in (library) software with little performance penalty. Evaluations running micro- and SPLASH-2 benchmarks on a 32way multiprocessor support our decision to optimize for commit by showing that only 1-2 % of transactions abort. 1.
Making the fast case common and the uncommon case simple in unbounded transactional memory
- In ISCA
, 2007
"... Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, allowing programmers to exploit more effectively the soon-to-be-ubiquitous multi-core designs. Several recent proposals have extended the original bounded transactional memory ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, allowing programmers to exploit more effectively the soon-to-be-ubiquitous multi-core designs. Several recent proposals have extended the original bounded transactional memory to unbounded transactional memory, a crucial step toward transactions becoming a generalpurpose primitive. Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging, and as a result, many proposed implementations are complex. This paper explores a different approach. First, we introduce the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible. Second, we propose ONETM to simplify the implementation of unbounded transactional memory by bounding the concurrency
Performance pathologies in Hardware Transactional Memory
- In: Proceedings of the 34rd Annual International Symposium on Computer Architecture. International Symposium on Computer Architecture
, 2007
"... pa•thol•o•gy any deviation from a healthy, normal, or efficient condition. Hardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, version management, and conflict resolution. Previously proposed HTMs represent three points in this design spa ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
pa•thol•o•gy any deviation from a healthy, normal, or efficient condition. Hardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, version management, and conflict resolution. Previously proposed HTMs represent three points in this design space: lazy conflict detection, lazy version management, committer wins (LL); eager conflict detection, lazy version management, requester wins (EL); and eager conflict detection, eager version management, and requester stalls with conservative deadlock avoidance (EE). To isolate the effects of these high-level design decisions, we develop a common framework that abstracts away differences in cache write policies, interconnects, and ISA to compare these three design points. Not surprisingly, the relative performance of these systems depends on the workload. Under light transactional loads they perform similarly, but under heavy loads they differ by up to 80%. None of the systems performs best on all of our benchmarks. We identify seven performance pathologies—interactions between workload and system that degrade performance—as the root cause of many performance differences: FRIENDLYFIRE, STARVINGWRITER, SERIALIZEDCOMMIT, FUTILESTALL, STARVIN-GELDER, RESTARTCONVOY, and DUELINGUPGRADES. We discuss when and on which systems these pathologies can occur and show that they actually manifest within TM workloads. The insight provided by these pathologies motivated four enhanced systems that often significantly reduce transactional memory overhead. Importantly, by avoiding transaction pathologies, each enhanced system performs well across our suite of benchmarks. Categories and Subject Descriptors C.4 [Performance of Systems]: performance attributes, design
Implementing signatures for transactional memory
- 40th Intl. Symp. on Microarchitecture
, 2007
"... Transactional Memory (TM) systems must track the read and write sets—items read and written during a transaction—to detect conflicts among concurrent transactions. Several TMs use signatures, which summarize unbounded read/write sets in bounded hardware at a performance cost of false positives (conf ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Transactional Memory (TM) systems must track the read and write sets—items read and written during a transaction—to detect conflicts among concurrent transactions. Several TMs use signatures, which summarize unbounded read/write sets in bounded hardware at a performance cost of false positives (conflicts detected when none exists). This paper examines different organizations to achieve hardware-efficient and accurate TM signatures. First, we find that implementing each signature with a single k-hashfunction Bloom filter (True Bloom signature) is inefficient, as it requires multi-ported SRAMs. Instead, we advocate using k single-hash-function Bloom filters in parallel (Parallel Bloom signature), using area-efficient single-ported SRAMs. Our formal analysis shows that both organizations perform equally well in theory and our simulationbased evaluation shows this to hold approximately in practice. We also show that by choosing high-quality hash functions we can achieve signature designs noticeably more accurate than the previously proposed implementations. Finally, we adapt Pagh and Rodler’s cuckoo hashing to implement Cuckoo-Bloom signatures. While this representation does not support set intersection, it mitigates false positives for the common case of small read/write sets and performs like a Bloom filter for large sets. 1.
Transactional Memory with Strong Atomicity Using Off-the-Shelf Memory Protection Hardware
"... This paper introduces a new way to provide strong atomicity in an implementation of transactional memory. Strong atomicity lets us offer clear semantics to programs, even if they access the same locations inside and outside transactions. It also avoids differences between hardware-implemented transa ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
This paper introduces a new way to provide strong atomicity in an implementation of transactional memory. Strong atomicity lets us offer clear semantics to programs, even if they access the same locations inside and outside transactions. It also avoids differences between hardware-implemented transactions and softwareimplemented ones. Our approach is to use off-the-shelf page-level memory protection hardware to detect conflicts between normal memory accesses and transactional ones. This page-level mechanism ensures correctness but gives poor performance because of the costs of manipulating memory protection settings and receiving notifications of access violations. However, in practice, we show how a combination of careful object placement and dynamic code update allows us to eliminate almost all of the protection changes. Existing implementations of strong atomicity in software rely on detecting conflicts by conservatively treating some non-transactional accesses as short transactions. In contrast, our page-level mechanism lets us be less conservative about how non-transactional accesses are treated; we avoid changes to non-transactional code until a possible conflict is detected dynamically, and we can respond to phase changes where a given instruction sometimes generates conflicts and sometimes does not. We evaluate our implementation with C # versions of many of the STAMP benchmarks, and show how it performs within 25 % of an implementation with weak atomicity on all the benchmarks we have studied. It avoids pathological cases in which other implementations of strong atomicity perform poorly.
Anatomy of a scalable software transactional memory
- 2009, 4th ACM SIGPLAN Workshop on Transactional Computing (TRANSACT’09
, 2009
"... Existing software transactional memory (STM) implementations often exhibit poor scalability, usually because of nonscalable mechanisms for read sharing, transactional consistency, and privatization; some STMs also have nonscalable centralized commit mechanisms. We describe novel techniques to elimin ..."
Abstract
-
Cited by 18 (7 self)
- Add to MetaCart
Existing software transactional memory (STM) implementations often exhibit poor scalability, usually because of nonscalable mechanisms for read sharing, transactional consistency, and privatization; some STMs also have nonscalable centralized commit mechanisms. We describe novel techniques to eliminate bottlenecks from all of these mechanisms, and present SkySTM, which employs these techniques. SkySTM is the first STM that supports privatization and scales on modern multicore multiprocessors with hundreds of hardware threads on multiple chips. A central theme in this work is avoiding frequent updates to centralized metadata, especially for multi-chip systems, in which the cost of accessing centralized metadata increases dramatically. A key mechanism we use to do so is a scalable nonzero indicator (SNZI), which was designed for this purpose. A secondary contribution of the paper is a new and simplified SNZI algorithm. Our scalable privatization mechanism imposes only about 4% overhead in low-contention experiments; when contention is higher, the overhead still reaches only 35 % with over 250 threads. In contrast, prior approaches have been reported as imposing over 100 % overhead in some cases, even with only 8 threads. 1.
NZTM: Nonblocking zero-indirection transactional memory
- In Workshop on Transactional Computing (TRANSACT
, 2007
"... This workshop paper reports work in progress on NZTM, a nonblocking, zero-indirection object-based hybrid transactional memory system. NZTM can execute transactions using best-effort hardware transactional memory if it is available and effective, but can execute transactions using NZSTM, our compati ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
This workshop paper reports work in progress on NZTM, a nonblocking, zero-indirection object-based hybrid transactional memory system. NZTM can execute transactions using best-effort hardware transactional memory if it is available and effective, but can execute transactions using NZSTM, our compatible software transactional memory system otherwise. Previous nonblocking software and hybrid transactional memory implementations pay a significant performance cost in the common case, as compared to simpler, blocking ones. However, blocking is problematic in some cases and unacceptable in others. NZTM is nonblocking, but shares the advantages of recent blocking STM proposals in the common case: it stores object data “in place”, thus avoiding the costly levels of indirection in previous nonblocking STMs, and improves cache performance by collocating object metadata with the data it controls. 1.
Tokentm: Efficient execution of large transactions with hardware transactional memory
- In Proceedings of the 35th Annual International Symposium on Computer Architecture
, 2008
"... Current hardware transactional memory systems seek to simplify parallel programming, but assume that large transactions are rare, so it is acceptable to penalize their performance or concurrency. However, future programmers may wish to use large transactions more often in order to integrate with hig ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Current hardware transactional memory systems seek to simplify parallel programming, but assume that large transactions are rare, so it is acceptable to penalize their performance or concurrency. However, future programmers may wish to use large transactions more often in order to integrate with higher-level programming models (e.g., database transactions) or perform selected I/O operations. To prevent the “small transactions are common” assumption from becoming self-fulfilling, this paper contributes TokenTM—an unbounded HTM that uses the abstraction of tokens to precisely track conflicts on an unbounded number of memory blocks. TokenTM implements tokens with new mechanisms, including
Pathological Interaction of Locks with Transactional Memory
- In Proceedings of the Third ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing
, 2008
"... Transactional memory (TM) promises to simplify multithreaded programming. Transactions provide mutual exclusion without the possibility of deadlock and the need to assign locks to data structures. To date, most investigations of transactional memory have looked at purely transactional systems that d ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Transactional memory (TM) promises to simplify multithreaded programming. Transactions provide mutual exclusion without the possibility of deadlock and the need to assign locks to data structures. To date, most investigations of transactional memory have looked at purely transactional systems that do not interact with legacy code using locks. Unfortunately, the reality of software engineering is that such interaction is likely. We investigate the interaction of transactional memory implementations and lock-based code. We identify and discuss five pathologies that arise with different systems when a lock is accessed both within and outside a transaction: Blocking, Deadlock, Livelock, Early Release, and Invisible Locking. To address these pathologies we designed and implemented transaction-safe locks (TxLocks) by modifying the existing lock implementation of the OpenSolaris C Library and extending the conflict resolution policy of a hardware transactional memory system. 1.
Flexible Decoupled Transactional Memory Support
- IN ISCA
, 2007
"... A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summari ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-thread access sets; per-thread conflict summary tables (CSTs), which identify the threads with which conflicts have occurred; Programmable Data Isolation, which maintains speculative updates in the local cache and employs a thread-private buffer (in virtual memory) in the rare event of overflow; and Alert-On-Update, which selectively notifies threads about coherence events. All mechanisms are softwareaccessible, to enable virtualization and to support transactions of arbitrary length. FlexTM allows software to determine when to manage conflicts (either eagerly or lazily), and to employ a variety of conflict management and commit protocols. We describe an STM-inspired protocol that uses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits. In experiments with a prototype on Simics/GEMS, FlexTM exhibits ∼5 × speedup over high-quality software TM, with no loss in policy flexibility. Its distributed commit protocol is also more efficient than a central hardware manager. Our results highlight the importance of flexibility in determining when to manage conflicts: lazy maximizes concurrency and helps to ensure forward progress while eager provides better overall utilization in a multi-programmed system.

