Results 1 -
9 of
9
CPHASH: A Cache-Partitioned Hash Table
"... CPHASH is a concurrent hash table for multicore processors. CPHASH partitions its table across the caches of cores and uses message passing to transfer lookups/inserts to a partition. CPHASH’s message passing avoids the need for locks, pipelines batches of asynchronous messages, and packs multiple m ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
CPHASH is a concurrent hash table for multicore processors. CPHASH partitions its table across the caches of cores and uses message passing to transfer lookups/inserts to a partition. CPHASH’s message passing avoids the need for locks, pipelines batches of asynchronous messages, and packs multiple messages into a single cache line transfer. Experiments on a 80-core machine with 2 hardware threads per core show that CPHASH has ∼ 1.6 × higher throughput than a hash table implemented using fine-grained locks. An analysis shows that CPHASH wins because it experiences fewer cache misses and its cache misses are less expensive, because of less contention for the on-chip interconnect and DRAM. CPSERVER, a key/value cache server using CPHASH, achieves ∼ 5 % higher throughput than a key/value cache server that uses a hash table with fine-grained locks, but both achieve better throughput and scalability than MEMCACHED. The throughput of CPHASH and CPSERVER also scale near-linearly with the number of cores.
Sun Labs at Oracle
"... Abstract. In a synchronous queue, producers and consumers handshake to exchange data. Recently, new scalable unfair synchronous queues were added to the Java JDK 6.0 to support high performance thread pools. This paper applies flat-combining to the problem of designing a synchronous queue algorithm. ..."
Abstract
- Add to MetaCart
Abstract. In a synchronous queue, producers and consumers handshake to exchange data. Recently, new scalable unfair synchronous queues were added to the Java JDK 6.0 to support high performance thread pools. This paper applies flat-combining to the problem of designing a synchronous queue algorithm. We first use the original flat-combining algorithm, a single “combiner ” thread acquires a global lock and services the other threads ’ combined requests with very low synchronization overheads. As we show, this single combiner approach delivers superior performance up to a certain level of concurrency, but unfortunately does not continue to scale beyond that point. In order to continue to deliver scalable performance as concurrency increases, we introduce a new parallel flat-combining algorithm. The new algorithm dynamically adds additional concurrently executing flat-combiners that coordinate their work. It enjoys the low coordination overheads of sequential flat combining, with the added scalability that comes with parallelism. Our novel unfair synchronous queue using parallel flat combining exhibits scalability far and beyond that of the JDK 6.0 algorithm: it matches it in the case of a single producer and consumer, and is superior throughout the concurrency range, delivering up to 11 (eleven) times the throughput at high concurrency. 1
Smart Data Structures: An Online Machine Learning Approach to Multicore Data Structures
"... As multicores become prevalent, the complexity of programming is skyrocketing. One major difficulty is efficiently orchestrating collaboration among threads through shared data structures. Unfortunately, choosing and hand-tuning data structure algorithms to get good performance across a variety of m ..."
Abstract
- Add to MetaCart
As multicores become prevalent, the complexity of programming is skyrocketing. One major difficulty is efficiently orchestrating collaboration among threads through shared data structures. Unfortunately, choosing and hand-tuning data structure algorithms to get good performance across a variety of machines and inputs is a herculean task to add to the fundamental difficulty of getting a parallel program correct. To help mitigate these complexities, this work develops a new class of parallel data structures called Smart Data Structures that leverage online machine learning to adapt automatically. We prototype and evaluate an open source library of Smart Data Structures for common parallel programming needs and demonstrate significant improvements over the best existing algorithms under a variety of conditions. Our results indicate that learning is a promising technique for balancing and adapting to complex, time-varying tradeoffs and achieving the best performance available. Categories and Subject Descriptors
Project-Teams REGAL
"... Abstract: The scalability of multithreaded applications on current multicore systems is hampered by the performance of critical sections, due in particular to the costs of access contention and cache misses. In this paper, we propose a new locking technique, Remote Core Locking (RCL) that aims to im ..."
Abstract
- Add to MetaCart
Abstract: The scalability of multithreaded applications on current multicore systems is hampered by the performance of critical sections, due in particular to the costs of access contention and cache misses. In this paper, we propose a new locking technique, Remote Core Locking (RCL) that aims to improve the performance of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server core. RCL limits the performance collapse observed with regular locks when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the core acquiring the lock: such data can typically remain in the server core's cache. Our microbenchmark shows that under high contention, RCL is always more e cient than the other state-of-the-art lock mechanisms, and a preliminary macrobenchmark evaluation shows performance gains on SPLASH-2 benchmarks (speedup up to 4.85) and on the Web cache application memcached (speedup up to 2.62). Key-words:
The Third Workshop on the Theory of Transactional Memory ∗
"... In this issue, we have a short column. It features a review of WTTM 2011, the third Workshop on the Theory of Transactional Memory, by Petr Kuznetsov and Srivatsan Ravi. This annual workshop focuses on developing theory for understanding transactional memory systems. It discusses recent achievements ..."
Abstract
- Add to MetaCart
In this issue, we have a short column. It features a review of WTTM 2011, the third Workshop on the Theory of Transactional Memory, by Petr Kuznetsov and Srivatsan Ravi. This annual workshop focuses on developing theory for understanding transactional memory systems. It discusses recent achievements as well as remaining challenges. Petr and Srivatsan review the most recent instance of the workshop, which was co-located with DISC in September 2011 in Rome. Many thanks to Petr and Srivatsan for their contribution. Call for contributions: I welcome suggestions for material to include in this column, including news, reviews, open problems, tutorials and surveys, either exposing the community to new and interesting topics, or providing new insight on well-studied topics by organizing them in new ways.
Concurrent FIFO Queues ⋆
, 2011
"... Abstract. We introduce the notion of a k-FIFO queue which may dequeue elements out of FIFO order up to a constant k ≥ 0. Retrieving the oldest element from the queue may require up to k + 1 dequeue operations (bounded fairness), which may return elements not younger than the k + 1 oldest elements in ..."
Abstract
- Add to MetaCart
Abstract. We introduce the notion of a k-FIFO queue which may dequeue elements out of FIFO order up to a constant k ≥ 0. Retrieving the oldest element from the queue may require up to k + 1 dequeue operations (bounded fairness), which may return elements not younger than the k + 1 oldest elements in the queue (bounded age) or nothing even if there are elements in the queue. A k-FIFO queue is starvation-free for finite k where k + 1 is what we call the worst-case semantical deviation (WCSD) of the queue from a regular FIFO queue. The WCSD bounds the actual semantical deviation (ASD) of a k-FIFO queue from a regular FIFO queue when applied to a given workload. Intuitively, the ASD keeps track of the number of dequeue operations necessary to return oldest elements and the age of dequeued elements. We show that a number of existing concurrent algorithms implement k-FIFO queues whose WCSD are determined by configurable constants independent from any workload. We then introduce so-called Scal queues, which implement k-FIFO queues with generally larger, workloaddependent as well as unbounded WCSD. Since ASD cannot be obtained without prohibitive overhead we have developed a tool that computes lower bounds on ASD from time-stamped runs. Our micro- and macrobenchmarks on a state-ofthe-art 40-core multiprocessor machine show that Scal queues, as an immediate consequence of their weaker WCSD, outperform and outscale existing implementations at the expense of moderately increased lower bounds on ASD. 1
A Dynamic Elimination-Combining Stack Algorithm
"... Abstract. Two key synchronization paradigms for the construction of scalable concurrent data-structures are software combining and elimination. Elimination-based concurrent data-structures allow operations with reverse semantics (such as push and pop stack operations) to “collide” and exchange value ..."
Abstract
- Add to MetaCart
Abstract. Two key synchronization paradigms for the construction of scalable concurrent data-structures are software combining and elimination. Elimination-based concurrent data-structures allow operations with reverse semantics (such as push and pop stack operations) to “collide” and exchange values without having to access a central location. Software combining, on the other hand, is effective when colliding operations have identical semantics: when a pair of threads performing operations with identical semantics collide, the task of performing the combined set of operations is delegated to one of the threads and the other thread waits for its operation(s) to be performed. Applying this mechanism iteratively can reduce memory contention and increase throughput. The most highly scalable prior concurrent stack algorithm is the eliminationbackoff stack [5]. The elimination-backoff stack provides high parallelism for symmetric workloads in which the numbers of push and pop operations are roughly equal, but its performance deteriorates when workloads are asymmetric. We present DECS, a novel Dynamic Elimination-Combining Stack algorithm, that scales well for all workload types. While maintaining the simplicity and low-overhead of the elimination-bakcoff stack, DECS manages to benefit from collisions of both identical- and reverse-semantics operations. Our empirical evaluation shows that DECS scales significantly better than both blocking and non-blocking best prior stack algorithms. Supported by a grant from the Lynne and William Frankel Center for Computer
Algorithms, Performance
"... Lock-freedom is a progress guarantee that ensures overall program progress. Wait-freedom is a stronger progress guarantee that ensures the progress of each thread in the program. While many practical lock-free algorithms exist, wait-free algorithms are typically inefficient and hardly used in practi ..."
Abstract
- Add to MetaCart
Lock-freedom is a progress guarantee that ensures overall program progress. Wait-freedom is a stronger progress guarantee that ensures the progress of each thread in the program. While many practical lock-free algorithms exist, wait-free algorithms are typically inefficient and hardly used in practice. In this paper, we propose a methodology called fast-path-slow-path for creating efficient waitfree algorithms. The idea is to execute the efficient lock-free version most of the time and revert to the wait-free version only when things go wrong. The generality and effectiveness of this methodology is demonstrated by two examples. In this paper, we apply this idea to a recent construction of a wait-free queue, bringing the wait-free implementation to perform in practice as efficient as the lock-free implementation. In another work, the fast-path-slow-path methodology has been used for (dramatically) improving the performance of a wait-free linked-list.
Incorrect Systems: It’s not the Problem, It’s the Solution ∗
"... We present an overview of state-of-the-art work in the engineering of digital systems (hardware and software) where traditional correctness requirements are relaxed, usually for higher performance and lower resource consumption but possibly also for other non-functional properties such as more robus ..."
Abstract
- Add to MetaCart
We present an overview of state-of-the-art work in the engineering of digital systems (hardware and software) where traditional correctness requirements are relaxed, usually for higher performance and lower resource consumption but possibly also for other non-functional properties such as more robustness and less cost. The work presented here is categorized into work that involves just hardware, hardware and software, and just software. In particular, we discuss work on probabilistic and approximate design of processors, unreliable cores in asymmetric multi-core architectures, besteffort computing, stochastic processors, accuracy-aware program transformations, and relaxed concurrent data structures. As common theme we identify, at least intuitively, “metrics of correctness ” in each piece of work which appear to be important for understanding the effects of relaxed correctness requirements and their relationship to performance improvements and resource consumption.

