Results 1 - 10
of
27
Virtualizing Transactional Memory
, 2005
"... Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from wellknown limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated ho ..."
Abstract
-
Cited by 224 (2 self)
- Add to MetaCart
Writing concurrent programs is difficult because of the complexity of ensuring proper synchronization. Conventional lock-based synchronization suffers from wellknown limitations, so researchers have considered nonblocking transactions as an alternative. Recent hardware proposals have demonstrated how transactions can achieve high performance while not suffering limitations of lock-based mechanisms. However, current hardware proposals require programmers to be aware of platform-specific resource limitations such as buffer sizes, scheduling quanta, as well as events such as page faults, and process migrations. If the transactional model is to gain wide acceptance, hardware support for transactions must be virtualized to hide these limitations in much the same way that virtual memory shields the programmer from platform-specific limitations of physical memory. This paper proposes Virtual Transactional Memory (VTM), a user-transparent system that shields the programmer from various platform-specific resource limitations. VTM maintains the performance advantage of hardware transactions, incurs low overhead in time, and has modest costs in hardware support. While many system-level challenges remain, VTM takes a step toward making transactional models more widely acceptable.
Hybrid Transactional Memory
, 2006
"... High performance parallel programs are currently difficult to write and debug. One major source of difficulty is protecting concurrent accesses to shared data with an appropriate synchronization mechanism. Locks are the most common mechanism but they have a number of disadvantages, including possibl ..."
Abstract
-
Cited by 62 (0 self)
- Add to MetaCart
High performance parallel programs are currently difficult to write and debug. One major source of difficulty is protecting concurrent accesses to shared data with an appropriate synchronization mechanism. Locks are the most common mechanism but they have a number of disadvantages, including possibly unnecessary serialization, and possible deadlock. Transactional memory is an alternative mechanism that makes parallel programming easier. With transactional memory, a transaction provides atomic and serializable operations on an arbitrary set of memory locations. When a transaction commits, all operations within the transaction become visible to other threads. When it aborts, all operations in the transaction are rolled back. Transactional
The stampede approach to thread-level speculation
- ACM Transactions on Computer Systems
, 2005
"... Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multipr ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86 % and 56%, four others by more than 8%, and an average across all applications of 16%—confirming that TLS is a promising way
Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors
- In HPCA 8
, 2002
"... With speculative thread-level parallelization, codes that cannot be fully compiler-analyzed are aggressively executed in parallel. If the hardware detects a cross-thread dependence violation, it squashes offending threads and resumes execution. Unfortunately, frequent squashing cripples performance. ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
With speculative thread-level parallelization, codes that cannot be fully compiler-analyzed are aggressively executed in parallel. If the hardware detects a cross-thread dependence violation, it squashes offending threads and resumes execution. Unfortunately, frequent squashing cripples performance.
Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors
- In HPCA 9
, 2003
"... Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Su ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Thread-level speculation provides architectural support to aggressively run hard-to-analyze code in parallel. As speculative tasks run concurrently, they generate unsafe or speculative memory state that needs to be separately buffered and managed in the presence of distributed caches and buffers. Such state may contain multiple versions of the same variable.
An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors
- Journal of Instruction-Level Parallelism
, 2001
"... We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at run-time. Besides resolving many name and true data dependencies through dynamic renaming and forwarding, respectively, our method supports parallel commit operations. Performance results collected on an architectural simulator and validated on a commercial multiprocessor show that the overhead can be reduced to less than ten instructions per speculative memory operation. Moreover, we demonstrate that a ten-fold speedup is possible on some of the difficult-toparallelize loops in the Perfect Club benchmark suite on a 16-way multiprocessor.
Tradeoffs in Buffering Speculative Memory State for Thread-Level Speculation in Multiprocessors
- ACM Transactions on Architecture and Code Optimization
, 2005
"... this paper, we introduce a novel taxonomy of approaches to buffer and manage multiversion speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of th ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
this paper, we introduce a novel taxonomy of approaches to buffer and manage multiversion speculative memory state in multiprocessors. We also present a detailed complexity-benefit tradeoff analysis of the different approaches. Finally, we use numerical applications to evaluate the performance of the approaches under a single architectural framework. Our key insights are that support for buffering the state of multiple speculative tasks and versions per processor is more complexity-effective than support This paper extends an earlier version that appeared in the 9th International Symposium on High Performance Computer Architecture (HPCA), February 2003
Design space exploration of a software speculative parallelization scheme
- IEEE Transactions on Parallel and Distributed Systems
, 2005
"... Abstract—With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardwar ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Abstract—With speculative parallelization, code sections that cannot be fully analyzed by the compiler are optimistically executed in parallel. Hardware schemes are fast but expensive and require modifications to the processors and/or memory system. Software schemes require no changes to the hardware of existing shared-memory systems, but can suffer from significant overheads involved with the speculative execution. In fact, the performance of software schemes is highly dependent on application characteristics, the design and implementation of the scheme, and the system configuration and size. This paper explores the design space of a recently proposed software speculative parallelization scheme. In the process, we gain insight into the most beneficial features of software schemes for speculative parallelization, as well as the most influential application characteristics. For instance, experimental results show that, contrary to intuition, checking for data dependence violations on every speculative store, as opposed to at commit time, leads to little performance degradation in the worst case and to significantly better performance with large configurations. Also, scheduling policies based on windows can perform very close to fully dynamic policies with a fraction of the memory overhead. Finally, experimental results show consistent speedups in the execution of loops that cannot be parallelized at compile time, both with and without RAW data dependences, for 4 to 32 processors. Index Terms—Speculative parallelization, thread-level speculation, parallel architectures. 1
Tolerating dependences between large speculative threads via sub-threads
- In ISCA ’06
, 2006
"... Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and have minimal dependences between them. Recent work has ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Thread-level speculation (TLS) has proven to be a promising method of extracting parallelism from both integer and scientific workloads, targeting speculative threads that range in size from hundreds to several thousand dynamic instructions and have minimal dependences between them. Recent work has shown that TLS can offer compelling performance improvements for database workloads, but only when targeting much larger speculative threads of more than 50,000 dynamic instructions per thread, with many frequent data dependences between them. To support such large and dependent speculative threads, hardware must be able to buffer the additional speculative state, and must also address the more challenging problem of tolerating the resulting cross-thread data dependences. In this paper we present hardware support for large speculative threads that integrates several previous proposals for TLS hardware. We also introduce support for subthreads: a mechanism for tolerating cross-thread data dependences by checkpointing speculative execution. When speculation fails due to a violated data dependence, with sub-threads the failed thread need only rewind to the checkpoint of the appropriate sub-thread rather than rewinding to the start of execution; this significantly reduces the cost of mis-speculation. We evaluate our hardware support for large and dependent speculative threads in the database domain and find that the transaction response time for three of the five transactions from TPC-C (on a simulated 4-processor chip-multiprocessor) speedup by a factor of 1.9 to 2.9. 1.
Supporting speculative parallelization in the presence of dynamic data structures
- In PLDI ’10
"... The availability of multicore processors has led to significant interest in compiler techniques for speculative parallelization of sequential programs. Isolation of speculative state from non-speculative state forms the basis of such speculative techniques as this separation enables recovery from mi ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The availability of multicore processors has led to significant interest in compiler techniques for speculative parallelization of sequential programs. Isolation of speculative state from non-speculative state forms the basis of such speculative techniques as this separation enables recovery from misspeculations. In our prior work on CorD [35, 36] we showed that for array and scalar variable based programs copying of data between speculative and non-speculative memory can be highly optimized to support state separation that yields significant speedups on multicore machines available today. However, we observe that in context of heap-intensive programs that operate on linked dynamic data structures, state separation based speculative parallelization poses many challenges. The copying of data structures from non-speculative to speculative state (copy-in operation) can be very expensive due to the large sizes of dynamic data structures. The copying of updated data structures from speculative state to non-speculative state (copy-out operation) is made complex due to the changes in the shape and size of the dynamic data structure made by the speculative computation. In addition, we must contend with the need to translate pointers internal to dynamic data structures between their non-speculative and speculative memory addresses. In this paper we develop an augmented design for the representation of dynamic data structures such that all of the above operations can be performed efficiently. Our experiments demonstrate significant speedups on a real machine for a set of programs that make extensive use of heap based dynamic data structures.

