Results 1 - 10
of
14
Concurrent Programming Without Locks
, 2004
"... Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory data structures. However, their apparent simplicity is deceptive: it is hard to design scalable locking strategies because locks can harbour problems such as priority inversion, deadlock and convoying. Furt ..."
Abstract
-
Cited by 64 (3 self)
- Add to MetaCart
Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory data structures. However, their apparent simplicity is deceptive: it is hard to design scalable locking strategies because locks can harbour problems such as priority inversion, deadlock and convoying. Furthermore, scalable lock-based systems are not readily composable when building compound operations. In looking for solutions to these problems, interest has developed in nonblocking systems which have promised scalability and robustness by eschewing mutual exclusion while still ensuring safety. However, existing techniques for building non-blocking systems are rarely suitable for practical use, imposing substantial storage overheads, serialising non-conflicting operations, or requiring instructions not readily available on today’s CPUs. In this paper we present three APIs which make it easier to develop non-blocking implementations of arbitrary data structures. The first API is a multi-word compare-and-swap operation (MCAS) which atomically updates a set of memory locations. This can be used to advance a data structure from one consistent state to another. The second API is a word-based software transactional memory (WSTM) which can allow sequential code to be re-used more directly than with MCAS and which provides better scalability when locations are being read rather than being
Lock-free Dynamically Resizable Arrays
"... Abstract. We present a first lock-free design and practical implementation of a dynamically resizable array (vector). The most extensively used container in the C++ Standard Library is vector, offering a combination of dynamic memory management and efficient random access. Our approach is based on a ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
Abstract. We present a first lock-free design and practical implementation of a dynamically resizable array (vector). The most extensively used container in the C++ Standard Library is vector, offering a combination of dynamic memory management and efficient random access. Our approach is based on a single 32-bit word atomic compare-and-swap (CAS) instruction and our implementation is portable to all systems supporting CAS, and more. It provides a flexible, generic, linearizable and highly parallelizable STL like interface, effective lock-free memory allocation and management, and fast execution. Our current implementation is designed to be most efficient on the most recent multi-core architectures. The test cases on a dual-core Intel processor indicate that our lock-free vector outperforms its lock-based STL counterpart and the latest concurrent vector implementation provided by Intel by a factor of 10. The implemented approach is also applicable across a variety of symmetric multiprocessing (SMP) platforms. The performance evaluation on an 8-way AMD system with non-shared L2 cache demonstrated timing results comparable to the best available lock-based techniques for such systems. The presented design implements the most common STL vector’s interfaces, namely random access read and write, tail insertion and deletion, pre-allocation of memory, and query of the container’s size. Keywords: lock-free, STL, C++, vector, concurrency, real-time systems 1
Landing OpenMP on Cyclops-64: An Efficient Mapping of OpenMP to a Many-Core System-on-a-Chip
, 2006
"... This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique arc ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.
Simplifying concurrent algorithms by exploiting hardware transactional memory
- In SPAA ’10
"... We explore the potential of hardware transactional memory (HTM) to improve concurrent algorithms. We illustrate a number of use cases in which HTM enables significantly simpler code to achieve similar or better performance than existing algorithms for conventional architectures. We use Sun’s prototy ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We explore the potential of hardware transactional memory (HTM) to improve concurrent algorithms. We illustrate a number of use cases in which HTM enables significantly simpler code to achieve similar or better performance than existing algorithms for conventional architectures. We use Sun’s prototype multicore chip, codenamed Rock, to experiment with these algorithms, and discuss ways in which its limitations prevent better results, or would prevent production use of algorithms even if they are successful. Our use cases include concurrent data structures such as double ended queues, work stealing queues and scalable non-zero indicators, as well as a scalable malloc implementation and a simulated annealing application. We believe that our paper makes a compelling case that HTM has substantial potential to make effective concurrent programming easier, and that we have made valuable contributions in guiding designers of future HTM features to exploit this potential.
Efficient and reliable lock-free memory reclamation based on reference counting
- In Proc. 8th I-SPAN
, 2005
"... We present an efficient and practical lock-free implementation of a memory reclamation scheme based on reference counting, aimed for use with arbitrary lock-free dynamic data structures. The scheme guarantees the safety of local as well as global references, supports arbitrary memory reuse, uses ato ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present an efficient and practical lock-free implementation of a memory reclamation scheme based on reference counting, aimed for use with arbitrary lock-free dynamic data structures. The scheme guarantees the safety of local as well as global references, supports arbitrary memory reuse, uses atomic primitives which are available in modern computer systems and provides an upper bound on the memory prevented for reuse. To the best of our knowledge, this is the first lock-free algorithm that provides all of these properties. Experimental results indicate significant performance improvements for lock-free algorithms of dynamic data structures that require strong garbage collection support. 1.
Wait-free Programming for General Purpose Computations on Graphics Processors
"... The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of the single-instruction multiple-data (SIMD). However, unlike CPUs, GPUs devote their transistors mainly to data processing rather than data caching and flow control, and consequently most of the powerful GPUs with many cores do not support any synchronization mechanisms between their cores. This prevents GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of synchronization mechanisms in recent GPU architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-andset and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), whether N is the number of processes. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
NBMALLOC: Allocating Memory in a Lock-Free Manner
- ALGORITHMICA
, 2009
"... Efficient, scalable memory allocation for multithreaded applications on multiprocessors is a significant goal of recent research. In the distributed computing literature it has been emphasized that lock-based synchronization and concurrency-control may limit the parallelism in multiprocessor system ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Efficient, scalable memory allocation for multithreaded applications on multiprocessors is a significant goal of recent research. In the distributed computing literature it has been emphasized that lock-based synchronization and concurrency-control may limit the parallelism in multiprocessor systems. Thus, system services that employ such methods can hinder reaching the full potential of these systems. A natural research question is the pertinence and the impact of lock-free concurrency control in key services for multiprocessors, such as in the memory allocation service, which is the theme of this work. We show the design and implementation of NBMALLOC, a lock-free memory allocator designed to enhance the parallelism in the system. The architecture of NBMALLOC is inspired by Hoard, a well-known concurrent memory allocator, with modular design that preserves scalability and helps avoiding false-sharing and heap-blowup. Within our effort to design appropriate lockfree algorithms for NBMALLOC, we propose and show a lock-free implementation of a new data structure, flat-set, supporting conventional “internal” set operations as well as “inter-object ” operations, for moving items between flat-sets. The design of NBMALLOC also involved a series of other algorithmic problems, which are discussed in the paper. Further, we present the implementation of NBMALLOC and a
On The Power of Hardware Transactional Memory to Simplify Memory Management
"... Dynamic memory management is a significant source of complexity in the design and implementation of practical concurrent data structures. We study how hardware transactional memory (HTM) can be used to simplify and streamline memory reclamation for such data structures. We propose and evaluate sever ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Dynamic memory management is a significant source of complexity in the design and implementation of practical concurrent data structures. We study how hardware transactional memory (HTM) can be used to simplify and streamline memory reclamation for such data structures. We propose and evaluate several new HTMbased algorithms for the “Dynamic Collect ” problem that lies at the heart of many modern memory management algorithms. We demonstrate that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches. Despite recent theoretical arguments that HTM provides no worst-case advantages, our results support the claim that HTM can provide significantly better common-case performance, as well as reduced conceptual complexity.
Reliable and Efficient Concurrent Synchronization for Embedded Real-Time Software
"... The high degree of autonomy and increased complexity of future robotic spacecraft pose significant challenges in assuring their reliability and efficiency. To achieve fast and safe concurrent interactions in mission critical code, we survey the practical state-of-the-art nonblocking programming tech ..."
Abstract
- Add to MetaCart
The high degree of autonomy and increased complexity of future robotic spacecraft pose significant challenges in assuring their reliability and efficiency. To achieve fast and safe concurrent interactions in mission critical code, we survey the practical state-of-the-art nonblocking programming techniques. We study in detail two nonblocking approaches: (1) CAS-based algorithms and (2) Software Transactional Memory. We evaluate the strengths and weaknesses of each approach by applying each methodology for engineering the design and implementation of a nonblocking shared vector. Our study investigates how the application of nonblocking synchronization can help eliminate the problems of deadlock, livelock, and priority inversion and at the same time deliver a performance improvement in embedded real-time software. 1
Scalable Nonblocking Concurrent Objects for Mission Critical Code
"... The high degree of complexity and autonomy of future robotic space missions, such as Mars Science Laboratory (MSL), poses serious challenges in assuring their reliability and efficiency. Providing fast and safe concurrent synchronization is of critical importance to such autonomous embedded software ..."
Abstract
- Add to MetaCart
The high degree of complexity and autonomy of future robotic space missions, such as Mars Science Laboratory (MSL), poses serious challenges in assuring their reliability and efficiency. Providing fast and safe concurrent synchronization is of critical importance to such autonomous embedded software systems. The application of nonblocking synchronization is known to help eliminate the hazards of deadlock, livelock, and priority inversion. The nonblocking programming techniques are notoriously difficult to implement and offer a variety of semantic guarantees and usability and performance trade-offs. The present software development and certification methodologies applied at NASA do not reach the level of detail of providing guidelines for the design of concurrent software. The complex task of engineering reliable and efficient concurrent synchronization is left to the programmer’s ingenuity.

