Results 1 - 10
of
25
Privatization techniques for software transactional memory
- In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing (PODC’07
"... Early implementations of software transactional memory (STM) assumed that sharable data would be accessed only within transactions. Memory may appear inconsistent in programs that violate this assumption, even when program logic would seem to make extra-transactional accesses safe. Designing STM sys ..."
Abstract
-
Cited by 50 (17 self)
- Add to MetaCart
Early implementations of software transactional memory (STM) assumed that sharable data would be accessed only within transactions. Memory may appear inconsistent in programs that violate this assumption, even when program logic would seem to make extra-transactional accesses safe. Designing STM systems that avoid such inconsistency has been dubbed the privatization problem. We argue that privatization comprises a pair of symmetric subproblems: private operations may fail to see updates made by transactions that have committed but not yet completed; conversely, transactions that are doomed but have not yet aborted may see updates made by private code, causing them to perform erroneous, externally visible operations. We explain how these problems arise in different styles of STM, present strategies to address them, and discuss their implementation tradeoffs. We also propose a taxonomy of contracts between the system and the user, analogous to programmer-centric memory consistency models, which allow us to classify programs based on their privatization requirements. Finally, we present empirical comparisons of several privatization strategies. Our results suggest that the best strategy may depend on application characteristics.
An Integrated Hardware-Software Approach to Flexible Transactional Memory
, 2006
"... There has been considerable recent interest in both hardware and software transactional memory (TM). We present an intermediate approach, in which hardware serves to accelerate a TM implementation controlled fundamentally by software. Specifically, we describe an alert on update mechanism (AOU) that ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
There has been considerable recent interest in both hardware and software transactional memory (TM). We present an intermediate approach, in which hardware serves to accelerate a TM implementation controlled fundamentally by software. Specifically, we describe an alert on update mechanism (AOU) that allows a thread to receive fast, asynchronous notification when previously-identified lines are written by other threads, and a programmable data isolation mechanism (PDI) that allows a thread to hide its speculative writes from other threads, ignoring conflicts, until software decides to make them visible. These mechanisms reduce bookkeeping, validation, and copying overheads without constraining software policy on a host of design decisions. We have used AOU and PDI to implement a hardwareaccelerated software transactional memory system we call RTM.
K42: Building a Complete Operating System
, 2006
"... K42 is one of the few recent research projects that is examining operating system design structure issues in the context of new whole-system design. K42 is open source and was designed from the ground up to perform well and to be scalable, customizable, and maintainable. The project was begun in 199 ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
K42 is one of the few recent research projects that is examining operating system design structure issues in the context of new whole-system design. K42 is open source and was designed from the ground up to perform well and to be scalable, customizable, and maintainable. The project was begun in 1996 by a team at IBM Research. Over the last nine years there has been a development effort on K42 from between six to twenty researchers and developers across IBM, collaborating universities, and national laboratories. K42 supports the Linux API and ABI, and is able to run unmodified Linux applications and libraries. The approach we took in K42 to achieve scalability and customizability has been successful. The project has produced positive research results, has resulted in contributions to Linux and the Xen hypervisor on Power, and continues to be a rich platform for exploring system software technology. Today, K42, is one of the key exploratory platforms in the DOE’s FAST-OS program, is being used as a prototyping vehicle in IBM’s PERCS project, and is being used by universities and national labs for exploratory research. In this paper, we provide insight into building an entire system by discussing the motivation and history of K42, describing its fundamental technologies, and presenting an overview of the research directions we have been pursuing.
An Analysis of Linux Scalability to Many Cores
"... This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard paralle ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques— this paper introduces one new technique, sloppy counters—these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet. 1
Performance of memory reclamation for lockless synchronization
, 2007
"... Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dyn ..."
Abstract
-
Cited by 9 (8 self)
- Add to MetaCart
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking require a memory reclamation scheme that reclaims elements once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect the memory reclamation performance. We discuss the consequences of our results for programmers and algorithm designers. Finally, we describe the use of one scheme, quiescentstate-based reclamation, in the context of an OS kernel—an execution environment which is well suited to this scheme.
Making RCU safe for deep sub-millisecond response realtime applications
- In Proceedings of the 2004 USENIX Annual Technical Conference (FREENIX Track
, 2004
"... Linux TM has long been used for soft realtime applications. More recent work is preparing Linux for more aggressive realtime use, with scheduling latencies in the small number of hundreds of microseconds (that is right, microseconds, not milliseconds). The current Linux 2.6 RCU implementation both h ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Linux TM has long been used for soft realtime applications. More recent work is preparing Linux for more aggressive realtime use, with scheduling latencies in the small number of hundreds of microseconds (that is right, microseconds, not milliseconds). The current Linux 2.6 RCU implementation both helps and hurts. It helps by removing locks, thus reducing latency in general, but hurts by causing large numbers of RCU callbacks to be invoked all at once at the end of the grace period. This batching of callback invocation improves throughput, but unacceptably degrades realtime response for the more discerning realtime applications. This paper describes modifications to RCU that greatly reduce its effect on scheduling latency, without significantly degrading performance for non-realtime Linux servers. Although these modifications appear to prevent RCU from interfering with realtime scheduling, other Linux kernel components are still problematic. We are therefore working on tools to help identify the remaining problematic components and to definitively determine whether RCU is still an issue. In any case, to the best of our knowledge, this is the first time that anything resembling RCU has been modified to accommodate the needs of realtime applications. 1
Experience Distributing Objects in an SMMP OS
"... Designing and implementing system software so that it scales well on shared-memory multiprocessors (SMMPs) has proven to be surprisingly challenging. To improve scalability, most designers to date have focused on concurrency by iteratively eliminating the need for locks and reducing lock contention. ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Designing and implementing system software so that it scales well on shared-memory multiprocessors (SMMPs) has proven to be surprisingly challenging. To improve scalability, most designers to date have focused on concurrency by iteratively eliminating the need for locks and reducing lock contention. However, our experience indicates that locality is just as, if not more, important and that focusing on locality ultimately leads to a more scalable system. In this paper, we describe a methodology and a framework for constructing system software structured for locality, exploiting techniques similar to those used in distributed systems. Specifically, we found two techniques to be effective in improving scalability of SMMP operating systems: (i) an object-oriented structure that minimizes sharing by providing a natural mapping from independent requests to independent code paths and data structures, and (ii) the selective partitioning, distribution, and replication of object implementations in order to improve locality. We describe concrete examples of distributed objects and our experience implementing them. We demonstrate that the distributed implementations improve the scalability of operating-system-intensive parallel workloads.
Enabling autonomic system software with hot-swapping
- IBM Systems Journal
, 2003
"... Autonomic computing systems are designed to be self-diagnosing and self-modifying, such that they notice performance and correctness problems, pinpoint their causes, and react accordingly. These abilities can increase performance, uptime, and security, while simultaneously reducing the effort and kn ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Autonomic computing systems are designed to be self-diagnosing and self-modifying, such that they notice performance and correctness problems, pinpoint their causes, and react accordingly. These abilities can increase performance, uptime, and security, while simultaneously reducing the effort and knowledge required of system administrators. One way that systems can support these abilities is by allowing monitoring code, diagnostic code, and function implementations to be dynamically inserted and removed in live systems. This “hot swapping”
Introducing technology into the Linux kernel: a case study
- SIGOPS Oper. Syst. Rev
"... There can be no doubt that a great many technologies have been added to Linux TM over the past ten years. What is less well-known is that it is often necessary to introduce a large amount of Linux into a given technology in order to successfully introduce that technology into Linux. This paper illus ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
There can be no doubt that a great many technologies have been added to Linux TM over the past ten years. What is less well-known is that it is often necessary to introduce a large amount of Linux into a given technology in order to successfully introduce that technology into Linux. This paper illustrates such an introduction of Linux into technology with Read-Copy Update (RCU). The RCU API’s evolution over time clearly shows that Linux’s extremely diverse set of workloads and platforms has changed RCU to a far greater degree than RCU has changed Linux—and it is reasonable to expect that other technologies that might be proposed for inclusion into Linux would face similar challenges. In addition, this paper presents a summary of lessons learned and an attempt to foresee what additional challenges Linux might present to RCU. 1.
Making lockless synchronization fast: Performance implications of memory reclamation
- In 2006 International Parallel and Distributed Processing Symposium (IPDPS 2006
, 2006
"... Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dyn ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, priority inversion, and convoying. In both cases, dynamic data structures that avoid locking, require a memory reclamation scheme that reclaims nodes once they are no longer in use. The performance of existing memory reclamation schemes has not been thoroughly evaluated. We conduct the first fair and comprehensive comparison of three recent schemes—quiescent-state-based reclamation, epoch-based reclamation, and hazard-pointer-based reclamation—using a flexible microbenchmark. Our results show that there is no globally optimal scheme. When evaluating lockless synchronization, programmers and algorithm designers should thus carefully consider the data structure, the workload, and the execution environment, each of which can dramatically affect memory reclamation performance. 1

