Results 1 -
2 of
2
Architectural Support for Fair Reader-Writer Locking
"... Abstract—Many shared-memory parallel systems use lockbased synchronization mechanisms to provide mutual exclusion or reader-writer access to memory locations. Software locks are inefficient either in memory usage, lock transfer time, or both. Proposed hardware locking mechanisms are either too speci ..."
Abstract
- Add to MetaCart
Abstract—Many shared-memory parallel systems use lockbased synchronization mechanisms to provide mutual exclusion or reader-writer access to memory locations. Software locks are inefficient either in memory usage, lock transfer time, or both. Proposed hardware locking mechanisms are either too specific (for example, requiring static assignment of threads to cores and vice-versa), support a limited number of concurrent locks, require tag values to be associated with every memory location, rely on the low latencies of single-chip multicore designs or are slow in adversarial cases such as suspended threads in a lock queue. Additionally, few proposals cover reader-writer locks and their associated fairness issues. In this paper we introduce the Lock Control Unit (LCU) which is an acceleration mechanism collocated with each core to explicitly handle fast reader-writer locking. By associating a unique thread-id to each lock request we decouple the hardware lock from the requestor core. This provides correct and efficient execution in the presence of thread migration. By making the LCU logic autonomous from the core, it seamlessly handles thread preemption. Our design offers richer semantics than previous proposals, such as trylock support while providing direct core-to-core transfers. We evaluate our proposal with microbenchmarks, a finegrain Software Transactional Memory system and programs from the Parsec and Splash parallel benchmark suites. The lock transfer time decreases in up to 30 % when compared to previous hardware proposals. Transactional Memory systems limited by reader-locking congestion boost up to 3 × while still preserving graceful fairness and starvation freedom properties. Finally, commonly used applications achieve speedups up to a 7 % when compared to software models. Keywords-reader-writer locks; fairness; Lock Control Unit; I.
Active Memory Controller
"... Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intellige ..."
Abstract
- Add to MetaCart
Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs’ performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1 % of the die area of a typical high performance processor, based on a standard cell implementation.

