Results 1 -
7 of
7
Performance of memory reclamation for lockless synchronization
, 2007
"... Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dyn ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance. We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescentstate-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.
Design Evolution of the EROS Single-Level Store
- In Proceedings of the General Track: 2002 USENIX Annual Technical Conference
, 2002
"... File systems have (at least) two undesirable characteristics: both the addressing model and the consistency semantics differ from those of memory, leading to a change in programming model at the storage boundary. Main memory is a single flat space of pages with a simple durability (persistence) mo ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
File systems have (at least) two undesirable characteristics: both the addressing model and the consistency semantics differ from those of memory, leading to a change in programming model at the storage boundary. Main memory is a single flat space of pages with a simple durability (persistence) model: all or nothing. File content durability is a complex function of implementation, caching, and timing. Memory is globally consistent. File systems offer no global consistency model. Following a crash recovery, individual files may be lost or damaged, or may be collectively inconsistent even though they are individually sound.
Studying Network Protocol Offload With Emulation: Approach And Preliminary Results
- In Proceedings of the 12th Annual IEEE Symposium on High Performance Interconnects
, 2004
"... Abstract — To fully take advantage of high-speed networks while freeing CPU cycles for application processing, the industry is proposing new techniques relying on an extended role of the network interface card such as TCP Offload Engine and Remote Direct Memory Access. This paper presents an experim ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract — To fully take advantage of high-speed networks while freeing CPU cycles for application processing, the industry is proposing new techniques relying on an extended role of the network interface card such as TCP Offload Engine and Remote Direct Memory Access. This paper presents an experimental study aimed at collecting the performance data needed to assess these techniques. This work is based on the emulation of an advanced network interface card plugged on the I/O bus. In the experimental setting, a processor of a partitioned SMP machine is dedicated to network processing. Achieving a faithful emulation of a network interface card is one of the main concerns and it is guiding the design of the Offload Engine software. This setting has the advantage of being flexible so that many different offload scenarios can be evaluated. Preliminary throughput results of an emulated TCP Offload Engine demonstrate a large benefit. The emulated TCP Offload Engine indeed yields 600 to 900% improvement while still relying on memory copies at the kernel boundary. I.
Making lockless synchronization fast: Performance implications of memory reclamation
- In 2006 International Parallel and Distributed Processing Symposium (IPDPS 2006
, 2006
"... Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dyn ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking, require a memory reclamation scheme that reclaims nodes once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect memory reclamation performance. 1
A Portable Kernel Abstraction For Low-Overhead Ephemeral Mapping Management Abstract
"... Modern operating systems create ephemeral virtual-to-physical mappings for a variety of purposes, ranging from the implementation of interprocess communication to the implementation of process tracing and debugging. With succeeding generations of processors the cost of creating ephemeral mappings is ..."
Abstract
- Add to MetaCart
Modern operating systems create ephemeral virtual-to-physical mappings for a variety of purposes, ranging from the implementation of interprocess communication to the implementation of process tracing and debugging. With succeeding generations of processors the cost of creating ephemeral mappings is increasing, particularly when an ephemeral mapping is shared by multiple processors. To reduce the cost of ephemeral mapping management within an operating system kernel, we introduce the sf buf ephemeral mapping interface. We demonstrate how in several kernel subsystems — including pipes, memory disks, sockets, execve(), ptrace(), and the vnode pager — the current implementation can be replaced by calls to the sf buf interface. We describe the implementation of the sf buf interface on the 32-bit i386 architecture and the 64-bit amd64 architecture. This implementation reduces the cost of ephemeral mapping management by reusing wherever possible existing virtual-to-physical address mappings. We evaluate the sf buf interface for the pipe, memory disk and networking subsystems. Our results show that these subsystems perform significantly better when using the sf buf interface. On a multiprocessor platform interprocessor interrupts are greatly reduced in number or eliminated altogether. 1
A WAIT-FREE DYNAMIC STORAGE ALLOCATOR BY ADOPTING THE HELPING QUEUE PATTERN
"... Most of the real-time applicable dynamic storage allocators rely on conventional locking strategies for protecting globally accessible data. But it is common that lock compositions do not scale well under high allocation and deallocation rates in parallel scenarios, as they lead to convoy effects. F ..."
Abstract
- Add to MetaCart
Most of the real-time applicable dynamic storage allocators rely on conventional locking strategies for protecting globally accessible data. But it is common that lock compositions do not scale well under high allocation and deallocation rates in parallel scenarios, as they lead to convoy effects. Furthermore, lock compositions lead to jitter, which is often a critical factor in real-time systems. Additionally, it is often desirable to guarantee progress of threads in order to be able to determine the worst-case execution time. This led us designing a wait-free dynamic storage allocator (DSA), which can guarantee progress of threads and does not influence other threads to make progress. Our DSA implementation relies on a kind of buddy strategy with approximate best-fit. Hence, it ensures for this kind of allocation strategy typical memory wastage as a result of internal fragmentation. Preliminary tests show that we can outperform established DSA implementations in terms of predictability, like the famous TLSF memory allocator. To the best of our knowledge, our DSA is the first known approach using a scalable and bounded nonblocking synchronization strategy. Our approach towards a wait-free DSA algorithm is applicable in real-time applications where adequate a priori knowledge about the memory requirements is available because it uses a statically allocated heap. We think that most real-time systems — especially ones with hard timing constraints — fulfill this precondition.
Examiner: Per Lindström
, 2005
"... As multithreaded applications and multiprocessor systems become more and more common, the pressure on the underlying operating system increases and often the memory subsystem is the primary bottleneck in terms of scalability. For multithreaded applications such as databases to function efficiently, ..."
Abstract
- Add to MetaCart
As multithreaded applications and multiprocessor systems become more and more common, the pressure on the underlying operating system increases and often the memory subsystem is the primary bottleneck in terms of scalability. For multithreaded applications such as databases to function efficiently, they need an efficient scalable memory allocator. The reason that most standard memory allocators fail, given a multithreaded application, is that they employ a single memory pool from which to draw memory. Using a single pool increases the contention as multiple threads attempt to allocate or free memory and spend a lot of their time waiting for exclusive access. The allocator implementation in this thesis employs multiple memory pools to reduce contention and the results show an improvement of close to 14 times on a Solaris 9 machine using 16 threads. ii

