Results 11 - 20
of
79
Lock-Free Reference Counting
- in Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing
, 2001
"... Assuming the existence of garbage collection makes it easier to design implementations of concurrent data structures. However, this assumption limits their applicability. We present a methodology that, for a significant class of data structures, allows designers to first tackle the easier problem of ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
Assuming the existence of garbage collection makes it easier to design implementations of concurrent data structures. However, this assumption limits their applicability. We present a methodology that, for a significant class of data structures, allows designers to first tackle the easier problem of designing a garbagecollection -dependent implementation, and then apply our methodology to achieve a garbage-collectionindependent one. Our methodology is based on the well-known reference counting technique, and employs the double compare-and-swap operation.
The Performance of Work Stealing in Multiprogrammed Environments
- IN PROCEEDINGS OF THE 1998 ACM SIGMETRICS INTERNATIONAL CONFERENCE ON MEASUREMENT AND MODELING OF COMPUTER SYSTEMS, POSTER SESSION
, 1997
"... We study the performance of user-level thread schedulers in multiprogrammed environments. Our goal is a user-level thread scheduler that delivers efficient performance under multiprogramming without any need for kernel-level resource management, such as coscheduling or process control. We show that ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
We study the performance of user-level thread schedulers in multiprogrammed environments. Our goal is a user-level thread scheduler that delivers efficient performance under multiprogramming without any need for kernel-level resource management, such as coscheduling or process control. We show that a non-blocking implementation of the work-stealing algorithm achieves this goal. With this implementation, the execution time of a computation running with arbitrarily many processes on arbitrarily many processors can be modeled as a simple function of work and critical-path length. This model holds even when the processes run on a set of processors that arbitrarily grows and shrinks over time. We observe linear speedup whenever the number of processes is small relative to the average parallelism.
Understanding tradeoffs in software transactional memory
- IN PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION
, 2007
"... There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper reexamines the design decisions behind several of these stateof-the-art algorithms, adopting some ideas, rejecting others, all in an atte ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper reexamines the design decisions behind several of these stateof-the-art algorithms, adopting some ideas, rejecting others, all in an attempt to make STMs faster. We created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STM algorithms versus fine-grained hand-crafted ones. We were able to make several illuminating observations regarding lock acquisition order, the interaction of STMs with memory management schemes, and the role of overheads and abort rates in STM performance.
Hardware Acceleration of Software Transactional Memory
- DEPT. OF COMPUTER SCIENCE, UNIV. OF ROCHESTER
, 2006
"... Transactional memory (TM) systems seek to increase scalability, reduce programming complexity, and overcome the various semantic problems associated with locks. Software TM proposals run on stock processors and provide substantial flexibility in policy, but incur significant overhead for data versio ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
Transactional memory (TM) systems seek to increase scalability, reduce programming complexity, and overcome the various semantic problems associated with locks. Software TM proposals run on stock processors and provide substantial flexibility in policy, but incur significant overhead for data versioning and validation in the face of conflicting transactions. Hardware TM proposals have the advantage of speed, but are typically highly ambitious, embed significant amounts of policy in silicon, and provide no clear migration path for software that must also run on legacy machines. We advocate an intermediate approach, in which hardware is used to accelerate a TM implementation controlled fundamentally by software. We present a system, RTM, that embodies this approach. It consists of a novel transactional MESI (TMESI) protocol and accompanying TM software. TMESI eliminates the key software overheads of data copying, garbage collection, and validation, without introducing any global consensus algorithm in the cache coherence protocol (a commit is allowed to perform using only a few cycles of completely local operation). The only change to the snooping interface is a “threatened” signal analogous to the existing “shared” signal. By leaving policy to software, RTM allows us to experiment with a wide variety of policies for contention management, deadlock and livelock avoidance, data granularity, nesting, and virtualization.
Thread Quantification for Concurrent Shape Analysis
"... Abstract. We present new algorithms for automatically verifying properties of programs with an unbounded number of threads. Our algorithms are based on a new abstract domain whose elements represent thread-quantified invariants: i.e., invariants satified by all threads. We exploit existing abstracti ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Abstract. We present new algorithms for automatically verifying properties of programs with an unbounded number of threads. Our algorithms are based on a new abstract domain whose elements represent thread-quantified invariants: i.e., invariants satified by all threads. We exploit existing abstractions to represent the invariants. Thus, our technique lifts existing abstractions by wrapping universal quantification around elements of the base abstract domain. Such abstractions are effective because they are thread-modular: e.g., they can capture correlations between the local variables of the same thread as well as correlations between the local variables of a thread and global variables, but forget correlations between the states of distinct threads. (The exact nature of the abstraction, of course, depends on the base abstraction lifted in this style.) We present techniques for computing sound transformers for the new abstraction by using transformers of the base abstract domain. We illustrate our technique in this paper by instantiating it to the Boolean Heap abstraction, producing a Quantified Boolean Heap abstraction. We have implemented an instantiation of our technique with Canonical Abstraction as the base abstraction and used it to successfully verify linearizability of data-structures in the presence of an unbounded number of threads. 1
A Lock-Free Approach to Object Sharing in Real-Time Systems
, 1997
"... This work aims to establish the viability of lock-free object sharing in uniprocessor real-time systems. Naive usage of conventional lock-based object-sharing schemes in real-time systems leads to unbounded priority inversion. A priority inversion occurs when a task is blocked by a lower-priority ta ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This work aims to establish the viability of lock-free object sharing in uniprocessor real-time systems. Naive usage of conventional lock-based object-sharing schemes in real-time systems leads to unbounded priority inversion. A priority inversion occurs when a task is blocked by a lower-priority task that is inside a critical section. Mechanisms that bound priority inversion usually entail kernel overhead that is sometimes excessive. We propose that lock-free objects offer an attractive alternative to lock-based schemes because they eliminate priority inversion and its associated problems. On the surface, lock-free objects may seem to be unsuitable for hard real-time systems because accesses to such objects are not guaranteed to complete in bounded time. Nonetheless, we present scheduling conditions that demonstrate the applicability of lock-free objects in hard real-time systems. Our scheduling conditions are applicable to schemes such as rate-monotonic scheduling and earliest-deadline-...
Automatically verifying concurrent queue algorithms
- Electr. Notes Theor. Comput. Sci
, 2003
"... Concurrent FIFO queues are a common component of concurrent systems. Using a single shared lock to prevent concurrent manipulations of queue contents reduces system concurrency. Therefore, many algorithms were suggested to increase concurrency while maintaining the correctness of queue manipulations ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Concurrent FIFO queues are a common component of concurrent systems. Using a single shared lock to prevent concurrent manipulations of queue contents reduces system concurrency. Therefore, many algorithms were suggested to increase concurrency while maintaining the correctness of queue manipulations. This paper shows how to automatically interpretation techniques. In particular, we verify all the safety properties originally specified for two concurrent queue algorithms without imposing an a priori bound on the number of allocated objects and threads. 1
Multigrain parallel Delaunay mesh generation: Challenges and opportunities for multithreaded architectures
- In Proceedings of the 19th annual international conference on Supercomputing
, 2005
"... Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMTbased architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMTbased architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on Parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. This multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our experimental evaluation shows that current SMTs are not capable of executing fine-grain parallelism in PCDM. However, experiments on a simulated SMT indicate that with modest hardware support it is possible to exploit fine-grain parallelism opportunities. The exploitation of fine-grain parallelism results to higher performance than a pure MPI implementation and closes the gap between the performance of PCDM and the state-of-the-art sequential mesher on a single physical processor. Our findings extend to other adaptive and irregular multigrain, parallel algorithms. 1
Efficient and practical constructions of LL/SC variables
- In Proceedings of the 22nd annual ACM Symposium on Principles of Distributed Computing
, 2003
"... Over the past decade, a pair of synchronization instructions known as LL/SC has emerged as the most suitable set of instructions to be used in the design of lock-free algorithms. However, no existing multiprocessor system supports these instructions in hardware. Instead, most modern multipro-cessors ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Over the past decade, a pair of synchronization instructions known as LL/SC has emerged as the most suitable set of instructions to be used in the design of lock-free algorithms. However, no existing multiprocessor system supports these instructions in hardware. Instead, most modern multipro-cessors support instructions such as CAS or RLL/RSC (e.g. POWER4, MIPS, SPARC, IA-64). This paper presents two efficient algorithms that implement 64-bit LL/SC from 64-bit CAS or RLL/RSC. Our re~ults are summarized as fol-lows. We present a practical algorithm for implementing a 64-bit LL/SC object from 64-bit CAS or RLL/RSC objects. Our result shows, for the first time, a practical way of simu-lating a 64-bit LL/SC memory word using 64-bit CAS mem-ory words (or 64-bit RLL/RSC memory words), incurring only a small constant space overhead per process and a small constant factor slowdown. Although our first solution performs correctly in any practical system, its theoretical correctness depends on un-bounded sequence numbers. We present a bounded algo-rithm that implements a 64-bit LL/SC object from 64-bit CAS or RLL/RSC objects, and has the same time and space complexities as the first algorithm. This and the previous algorithm improve on existing im-plementations of LL/SC objects by Anderson and Moir in 1995, and Moir in 1997. 1.
CAS-based lock-free algorithm for shared deques
- In the 9th Euro-Par Conference on Parallel Processing
, 2003
"... Abstract. This paper presents the first lock-free algorithm for shared double-ended queues (deques) based on the single-address atomic primitives CAS (Compare-and-Swap) or LL/SC (Load-Linked and Store-Conditional). The algorithm can use single-word primitives, if the maximum deque size is static. To ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Abstract. This paper presents the first lock-free algorithm for shared double-ended queues (deques) based on the single-address atomic primitives CAS (Compare-and-Swap) or LL/SC (Load-Linked and Store-Conditional). The algorithm can use single-word primitives, if the maximum deque size is static. To allow the deque’s size to be dynamic, the algorithm employs single-address double-width primitives. Prior lockfree algorithms for shared deques depend on the strong DCAS (Double-Compare-and-Swap) atomic primitive, not supported on most processor architectures. The new algorithm offers significant advantages over prior lock-free shared deque algorithms with respect to performance and the strength of required primitives. In turn, lock-free algorithms provide significant reliability and performance advantages over lock-based implementations. 1

