Results 1 - 10
of
10
An Analysis of Linux Scalability to Many Cores
"... This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard paralle ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
This paper analyzes the scalability of seven system applications (Exim, memcached, Apache, PostgreSQL, gmake, Psearchy, and MapReduce) running on Linux on a 48core computer. Except for gmake, all applications trigger scalability bottlenecks inside a recent Linux kernel. Using mostly standard parallel programming techniques— this paper introduces one new technique, sloppy counters—these bottlenecks can be removed from the kernel or avoided by changing the applications slightly. Modifying the kernel required in total 3002 lines of code changes. A speculative conclusion from this analysis is that there is no scalability reason to give up on traditional operating system organizations just yet. 1
Project Kittyhawk: building a global-scale computer: Blue Gene/P as a generic computing platform
- SIGOPS Operating Systems Review
"... This paper describes Project Kittyhawk, an undertaking at IBM Research to explore the construction of a nextgeneration platform capable of hosting many simultaneous web-scale workloads. We hypothesize that for a large class of web-scale workloads the Blue Gene/P platform is an order of magnitude mor ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper describes Project Kittyhawk, an undertaking at IBM Research to explore the construction of a nextgeneration platform capable of hosting many simultaneous web-scale workloads. We hypothesize that for a large class of web-scale workloads the Blue Gene/P platform is an order of magnitude more efficient to purchase and operate than the commodity clusters in use today. Driven by scientific computing demands the Blue Gene designers pursued an aggressive system-on-a-chip methodology that led to a scalable platform composed of air-cooled racks. Each rack contains more than a thousand independent computers with highspeed interconnects inside and between racks. We postulate that the same demands of efficiency and density apply to web-scale platforms. This project aims to develop the system software to enable Blue Gene/P as a generic platform capable of being used by heterogeneous workloads. We describe our firmware and operating system work to provide Blue Gene/P with generic system software, one of the results of which is the ability to run thousands of heterogeneous Linux instances connected by TCP/IP networks over the high-speed internal interconnects.
Your computer is already a distributed system. Why isn’t your OS?
"... We argue that a new OS for a multicore machine should be designed ground-up as a distributed system, using concepts from that field. Modern hardware resembles a networked system even more than past large multiprocessors: ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We argue that a new OS for a multicore machine should be designed ground-up as a distributed system, using concepts from that field. Modern hardware resembles a networked system even more than past large multiprocessors:
Corey: An Operating System for Many Cores
"... Multiprocessor application performance can be limited by the operating system when the application uses the operating system frequently and the operating system services use data structures shared and modified by multiple processing cores. If the application does not need the sharing, then the opera ..."
Abstract
- Add to MetaCart
Multiprocessor application performance can be limited by the operating system when the application uses the operating system frequently and the operating system services use data structures shared and modified by multiple processing cores. If the application does not need the sharing, then the operating system will become an unnecessary bottleneck to the application’s performance. This paper argues that applications should control sharing: the kernel should arrange each data structure so that only a single processor need update it, unless directed otherwise by the application. Guided by this design principle, this paper proposes three operating system abstractions (address ranges, kernel cores, and shares) that allow applications to control inter-core sharing and to take advantage of the likely abundance of cores by dedicating cores to specific operating system functions. Measurements of microbenchmarks on the Corey prototype operating system, which embodies the new abstractions, show how control over sharing can improve performance. Application benchmarks, using MapReduce and a Web server, show that the improvements can be significant for overall performance: MapReduce on Corey performs 25 % faster than on Linux when using 16 cores. Hardware event counters confirm that these improvements are due to avoiding operations that are expensive on multicore machines. 1
Systems Group, ETH Zurich
"... Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but ..."
Abstract
- Add to MetaCart
Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures. We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing. We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware. 1
The Multikernel: A new . . .
"... Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but ..."
Abstract
- Add to MetaCart
Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures. We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing. We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.
Enhancing Operating System Support for . . .
"... Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how ..."
Abstract
- Add to MetaCart
Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27 % improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10 % reduction in cache miss rates due to reduced cache pollution, resulting in up to 7 % improvement in IPC; and (3) up to 70 % reduction in remote cache accesses due to thread clustering, resulting in up to 7 % application-level improvement.
Applications on Virtualized Platforms
, 2010
"... NOTES: This report has been submitted for early dissemination of its contents. It will thus be subjective to change without prior notice. It will also be probabaly copyrighted if accepted for publication in a referred conference of journal. Parallel Processing Institute makes no gurantee on the cons ..."
Abstract
- Add to MetaCart
NOTES: This report has been submitted for early dissemination of its contents. It will thus be subjective to change without prior notice. It will also be probabaly copyrighted if accepted for publication in a referred conference of journal. Parallel Processing Institute makes no gurantee on the consequences of using the viewpoints and results in the technical report. It requires prior specific permission to republish, to redistribute to copy the report elsewhere. Characterizing the Performance and Scalability of Many-core
A Case for Scaling Applications to Many-core with OS Clustering
"... This paper proposes an approach to scaling UNIX-like operating systems for many cores in a backward-compatible way, which still enjoys common wisdom in new operating system designs. The proposed system, called Cerberus, mitigates contention on many shared data structures within OS kernels by cluster ..."
Abstract
- Add to MetaCart
This paper proposes an approach to scaling UNIX-like operating systems for many cores in a backward-compatible way, which still enjoys common wisdom in new operating system designs. The proposed system, called Cerberus, mitigates contention on many shared data structures within OS kernels by clustering multiple commodity operating systems atop a VMM, and providing applications with the traditional shared memory interface. Cerberus extends a traditional VMM with efficient support for resource sharing and communication among the clustered operating systems. It also routes system calls of an application among operating systems, to provide applications with the illusion of running on a single operating system. We have implemented a prototype system based on Xen/Linux, which runs on an Intel machine with 16 cores and an AMD machine with 48 cores. Experiments with an unmodified MapReduce application, dbench, Apache Web Server and Memcached show that, given the nontrivial performance overhead incurred by the virtualization layer, Cerberus achieves up to 1.74X and 4.95X performance speedup compared to native Linux. It also scales better than a single Linux configuration. Profiling results further show that Cerberus wins due to mitigated contention and more efficient use of resources.
The Multikernel: A new OS architecture for . . .
, 2009
"... Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but ..."
Abstract
- Add to MetaCart
Commodity computer systems contain more and more processor cores and exhibit increasingly diverse architectural tradeoffs, including memory hierarchies, interconnects, instruction sets and variants, and IO configurations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of modern client and server workloads, coupled with the impossibility of statically optimizing an OS for all workloads and hardware variants pose serious challenges for operating system structures. We argue that the challenge of future multicore hardware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from distributed systems. We investigate a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing. We have implemented a multikernel OS to show that the approach is promising, and we describe how traditional scalability problems for operating systems (such as memory management) can be effectively recast using messages and can exploit insights from distributed systems and networking. An evaluation of our prototype on multicore systems shows that, even on present-day machines, the performance of a multikernel is comparable with a conventional OS, and can scale better to support future hardware.

