Results 1 - 10
of
17
STAMP: Stanford Transactional Applications for Multi-Processing
"... Abstract—Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been ev ..."
Abstract
-
Cited by 66 (6 self)
- Add to MetaCart
Abstract—Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, which do not stress a wide range of execution scenarios. We introduce the Stanford Transactional Applications for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems. STAMP includes eight applications and thirty variants of input parameters and data sets in order to represent several application domains and cover a wide range of transactional execution cases (frequent or rare use of transactions, large or small transactions, high or low contention, etc.). Moreover, STAMP is portable across many types of TM systems, including hardware, software, and hybrid systems. In this paper, we provide descriptions and a detailed characterization of the applications in STAMP. We also use the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics. I.
Is Transactional Programming Actually Easier?
"... Chip multi-processors (CMPs) have become ubiquitous, while tools that ease concurrent programming have not. The promise of increased performance for all applications through ever more parallel hardware requires good tools for concurrent programming, especially for average programmers. Transactional ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Chip multi-processors (CMPs) have become ubiquitous, while tools that ease concurrent programming have not. The promise of increased performance for all applications through ever more parallel hardware requires good tools for concurrent programming, especially for average programmers. Transactional memory (TM) has enjoyed recent interest as a tool that can help programmers program concurrently. The TM research community claims that programming with transactional memory is easier than alternatives (like locks), but evidence is scant. In this paper, we describe a user-study in which 147 undergraduate students in an operating systems course implemented the same programs using coarse and fine-grain locks, monitors, and transactions. We surveyed the students after the assignment, and examined their code to determine the types and frequency of programming errors for each synchronization technique. Inexperienced programmers found baroque syntax a barrier to entry for transactional programming. On average, subjective evaluation showed that students found transactions harder to use than coarse-grain locks, but slightly easier to use than fine-grained locks. Detailed examination of synchronization errors in the students’ code tells a rather different story. Overwhelmingly, the number and types of programming errors the students made was much lower for transactions than for locks. On a similar programming problem, over 70 % of students made errors with fine-grained locking, while less than 10 % made errors with transactions. 1
Why STM can be more than a Research Toy
"... Software Transactional Memory (STM) promises to simplify concurrent programming without requiring specific hardware support. Yet, STM’s credibility lies on the extent to which it enables to leverage multicores and outperform sequential code. A recent CACM paper [3] questioned this ability and sugges ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Software Transactional Memory (STM) promises to simplify concurrent programming without requiring specific hardware support. Yet, STM’s credibility lies on the extent to which it enables to leverage multicores and outperform sequential code. A recent CACM paper [3] questioned this ability and suggested the confinement of STM to a research toy. We revisit these conclusions through the most to date extensive comparison of STM performance to sequential code. We evaluate a state-of-the-art STM system, SwissTM, on a wide range of benchmarks and two different multicore systems. We dissect the inherent costs of synchronization as well as the overheads of compiler instrumentation and transparent privatization. Our results show that an STM with manually instrumented benchmarks and explicit privatization outperforms sequential code by up to 29 times on SPARC with 64 concurrent threads and by up to 9 times on x86 with 16 concurrent threads. Indeed the overheads of compiler instrumentation and transparent privatization are substantial, yet they do not prevent STM from generally outperforming sequential code.
Conflict Exceptions: Simplifying Concurrent Language Semantics with Precise Hardware Exceptions for Data-Races
, 2010
"... We argue in this paper that concurrency errors should be treated as exceptions, i.e., have fail-stop behavior and precise semantics. We propose an exception model based on conflict of synchronizationfree regions, which precisely detects a broad class of data-races. We show that our exceptions provid ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
We argue in this paper that concurrency errors should be treated as exceptions, i.e., have fail-stop behavior and precise semantics. We propose an exception model based on conflict of synchronizationfree regions, which precisely detects a broad class of data-races. We show that our exceptions provide enough guarantees to simplify high-level programming language semantics and debugging, but are significantly cheaper to enforce than traditional data-race detection. To make the performance cost of enforcement negligible, we propose architecture support for accurately detecting and precisely delivering these exceptions. We evaluate the suitability of our model as well as the behavior of our architectural mechanisms using the PARSEC benchmark suite and commercial applications. Our results show that the exception model largely reflects how programmers are already writing code and that the main memory, traffic and performance overheads of the enforcement mechanisms we propose are very low.
Ad Hoc Synchronization Considered Harmful
"... Many synchronizations in existing multi-threaded programs are implemented in an ad hoc way. The first part of this paper does a comprehensive characteristic study of ad hoc synchronizations in concurrent programs. By studying 229 ad hoc synchronizations in 12 programs of various types (server, deskt ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Many synchronizations in existing multi-threaded programs are implemented in an ad hoc way. The first part of this paper does a comprehensive characteristic study of ad hoc synchronizations in concurrent programs. By studying 229 ad hoc synchronizations in 12 programs of various types (server, desktop and scientific), including Apache, MySQL, Mozilla, etc., we find several interesting and perhaps alarming characteristics: (1) Every studied application uses ad hoc synchronizations. Specifically, there are 6–83 ad hoc synchronizations in each program. (2) Ad hoc synchronizations are error-prone. Significant percentages (22–67%) of these ad hoc synchronizations introduced bugs or severe performance issues. (3) Ad hoc synchronization implementations are diverse and many of them cannot be easily recognized as synchronizations, i.e. have poor readability and maintainability. The second part of our work builds a tool called SyncFinder to automatically identify and annotate ad hoc synchronizations in concurrent programs written in C/C++ to assist programmers in porting their code to better structured implementations, while also enabling other tools to recognize them as synchronizations. Our evaluation using 25 concurrent programs shows that, on average, SyncFinder can automatically identify 96 % of ad hoc synchronizations with 6 % false positives. We also build two use cases to leverage SyncFinder’s auto-annotation. The first one uses annotation to detect 5 deadlocks (including 2 new ones) and 16 potential issues missed by previous analysis tools in Apache, MySQL and Mozilla. The second use case reduces Valgrind data race checker’s false positive rates by 43–86%. 1
Mapping Out a Path from Hardware Transactional Memory to Speculative Multithreading
"... Abstract — This research demonstrates that coming support for hardware transactional memory can be leveraged to significantly reduce the cost of implementing true speculative multithreading. In particular, it explores the path from eager conflict detection HTM to full support of efficient speculativ ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract — This research demonstrates that coming support for hardware transactional memory can be leveraged to significantly reduce the cost of implementing true speculative multithreading. In particular, it explores the path from eager conflict detection HTM to full support of efficient speculative multithreading, focusing on the case where frequent memory dependencies exist between speculative threads. The result is a unified memory architecture capable of effective support for transactional parallel workloads and efficient speculative multithreading.
Transactional Value Prediction
- In Proceedings of the Fourth ACM SIGPLAN Workshop on Transactional Computing
, 2009
"... This workshop paper explores some ideas for value prediction and data speculation in hardware transactional memory. We present these ideas in the context of false sharing, at the cache line level, within hardware transactions. We distinguish between coherence conflicts, which may result from false s ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This workshop paper explores some ideas for value prediction and data speculation in hardware transactional memory. We present these ideas in the context of false sharing, at the cache line level, within hardware transactions. We distinguish between coherence conflicts, which may result from false sharing, from true data conflicts, which we call transactional conflicts. We build on some of the ideas of Huh et al. [1] to speculate in the presence of coherence conflicts, assuming no true data conflicts. We then validate data before committing. This dual speculation avoids aborting and restarting many transactions that conflict through false sharing. We show how these ideas, which we call Transactional Value Prediction, can be applied to a conventional best-effort hardware transactional memory. Our preliminary model, β-TVP, does not alter the underlying cache coherence protocol beyond what is already present in hardware transactional memory. β-TVP requires only minor, processor-local modifications to a conventional best-effort hardware transactional memory. Simple benchmarks show that β-TVP can dramatically increase throughput in the presence of false sharing, while incurring little overhead in its absence. 1.
Implementation Tradeoffs in the Design of Flexible Transactional Memory Support 1
"... We present FlexTM (FLEXible Transactional Memory), a high performance TM framework that allows software to determine when (eagerly, lazily, or in a mixed fashion) and how to manage conflicts, while employing hardware to manage transactional state and to track conflicts. FlexTM coordinates four decou ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present FlexTM (FLEXible Transactional Memory), a high performance TM framework that allows software to determine when (eagerly, lazily, or in a mixed fashion) and how to manage conflicts, while employing hardware to manage transactional state and to track conflicts. FlexTM coordinates four decoupled hardware mechanisms: read and write signatures, which summarize perthread access sets; per-thread conflict summary tables (CSTs), which identify the processors with which conflicts have occurred; Programmable Data Isolation, which buffers speculative updates in the local cache and uses an overflow table to handle unbounded updates; and Alert-On-Update, which notifies a thread immediately when a specified location is written by another processor. The CSTs enable an STM-inspired commit protocol that manages conflicts in a decentralized manner (no global arbitration) and allows parallel commits. We explore the implementation tradeoffs associated with FlexTM’s versioning and conflict detection mechanisms. Our results demonstrate that FlexTM exhibits ∼5 × speedup over high-quality software TMs, and ∼1.8 × speedup over hybrid TMs (those with software always in the loop), with no loss in policy flexibility. We find that the distributed commit protocol improves performance by 2–14 % over an aggressive centralized-arbiter mechanism (similar to BulkTM [7]) that also allows parallel commits. Finally, we compare the use of an aggressive hardware controller (as used in the base FlexTM design) to manage and to access any speculative transaction state overflowed from the cache, to a hardware-software approach dubbed FlexTM-S (FlexTM-Streamlined), where software manages the overflow region but uses a metadata cache to accelerate speculative data replacements and their subsequent accesses. We demonstrate that FlexTM-S’s performance is within 10 % of FlexTM’s despite its substantially simpler virtualization mechanism. 1.
Hardware Acceleration of Transactional Memory on Commodity Systems
"... The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory (STM) can reside outside an unmodified commodity p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory (STM) can reside outside an unmodified commodity processor core, thereby substantially reducing implementation costs. This paper introduces Transactional Memory Acceleration using Commodity Cores (TMACC), a hardware-accelerated TM system that does not modify the processor, caches, or coherence protocol. We present a complete hardware implementation of TMACC using a rapid prototyping platform. Using this hardware, we implement two unique conflict detection schemes which are accelerated using Bloom filters on an FPGA. These schemes employ novel techniques for tolerating the latency of fine-grained asynchronous communication with an out-of-core accelerator. We then conduct experiments to explore the feasibility of accelerating TM without modifying existing system hardware. We show that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance. In these cases, TMACC outperforms an STM by an average of 69 % in applications using moderate-length transactions, showing maximum speedup within 8 % of an upper bound on TM acceleration. Overall, we demonstrate that hardware can substantially accelerate the performance of an STM on unmodified commodity processors.
LiteTM: Reducing Transactional State Overhead
"... Abstract- Transactional memory (TM) has been proposed to address some of the programmability issues of chip multiprocessors. Hardware implementations of transactional memory (HTMs) have made significant progress in providing support for features such as long transactions that spill out of the cache, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract- Transactional memory (TM) has been proposed to address some of the programmability issues of chip multiprocessors. Hardware implementations of transactional memory (HTMs) have made significant progress in providing support for features such as long transactions that spill out of the cache, and context switches, page and thread migration in the middle of transactions. While essential for the adoption of HTMs in real products, supporting these features has resulted in significant state overhead. For instance, TokenTM adds at least 16 bits per block in the caches which is significant in absolute terms, and steals 16 of 64 (25%) memory ECC bits per block, weakening error protection. Also, the state bits nearly double the tag array size. These significant and practical concerns may impede the adoption of HTMs, squandering the progress achieved by HTMs. The overhead comes from tracking the thread identifier and the transactional read-sharer count at the L1-block granularity. The thread identifier is used to identify the transaction, if only one, to which an L1-evicted block belongs. The read-sharer count is used to identify conflicts involving multiple readers (i.e., write to a block with non-zero count). To reduce this overhead, we observe that the thread identifiers and read-sharer counts are not needed in a majority of cases. (1) Repeated misses to the same blocks are rare within a transaction (i.e., locality holds). (2) Transactional read-shared blocks that both are evicted from multiple sharers ’ L1s and are involved in conflicts are rare. Exploiting these observations, we propose a novel HTM, called LiteTM, which completely eliminates the count and identifier and uses software to infer the lost information. Using simulations of the STAMP benchmarks running on 8 cores, we show that LiteTM reduces TokenTM’s state overhead by about 87 % while performing within 4%, on average, and 10%, in the worst case, of TokenTM. 1

