Results 1 - 10
of
50
Disco: Running commodity operating systems on scalable multiprocessors
- ACM Transactions on Computer Systems
, 1997
"... In this paper we examine the problem of extending modern operating systems to run efficiently on large-scale shared memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 1970s, virtual machine monitors. We use virtual machines to run multiple c ..."
Abstract
-
Cited by 164 (6 self)
- Add to MetaCart
In this paper we examine the problem of extending modern operating systems to run efficiently on large-scale shared memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 1970s, virtual machine monitors. We use virtual machines to run multiple commodity operating systems on a scalable multiprocessor. This solution addresses many of the challenges facing the system software for these machines. We demonstrate our approach with a prototype called Disco that can run multiple copies of Silicon Graphics ’ IRIX operating system on a multiprocessor. Our experience shows that the overheads of the monitor are small and that the approach provides scalability as well as the ability to deal with the non-uniform memory access time of these systems. To reduce the memory overheads associated with running multiple operating systems, we have developed techniques where the virtual machines transparently share major data structures such as the program code and the file system buffer cache. We use the distributed system support of modern operating systems to export a partial single system image to the users. The overall solution achieves most of the benefits of operating systems customized for scalable multiprocessors yet it can be achieved with a significantly smaller implementation effort. 1
Unifying Data and Control Transformations for Distributed Shared-Memory Machines
, 1994
"... We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimiz ..."
Abstract
-
Cited by 150 (10 self)
- Add to MetaCart
We present a unified approach to locality optimization that employs both data and control transformations. Data transformations include changing the array layout in memory. Control transformations involve changing the execution order of programs. We have developed new techniques for compiler optimizations for distributed shared-memory machines, although the same techniques can be used for sequential machines with a memory hierarchy. Our compiler optimizations are based on an algebraic representation of data mappings and a new data locality model. We present a pure data transformation algorithm and an algorithm unifying data and control transformations. While there has been much work on control transformations, the opportunities for data transformations have been largely neglected. In fact, data transformations have the advantage of being applicable to programs that cannot be optimized with control transformations. The unified algorithm, which performs data and control transformations s...
Using Processor Affinity in Loop Scheduling on Shared-Memory Multiprocessors
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempt to achieve the minimum completion time by distributing the workload as evenly ..."
Abstract
-
Cited by 133 (2 self)
- Add to MetaCart
Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempt to achieve the minimum completion time by distributing the workload as evenly as possible, while minimizing the number of synchronization operations required. In this paper we consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to non-local data. We show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. We propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. We compare the performance of this new algorithm to ot...
Power Aware Page Allocation
- In Architectural Support for Programming Languages and Operating Systems
, 2000
"... One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that p ower these mobile devices. Memory is a particularly important tar get for e orts to improve energy e ciency. Memory technolo gy is becoming available that ..."
Abstract
-
Cited by 121 (9 self)
- Add to MetaCart
One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that p ower these mobile devices. Memory is a particularly important tar get for e orts to improve energy e ciency. Memory technolo gy is becoming available that o ers power management featur es such as the ability to put individual chips in any one of several di erent power modes. In this paper we explor e the interaction of page plac ement with static and dynamic hardware policies to exploit these emer ginghardwar efeatur es. In p articular, we c onsider p age allo cation p olicies that ancbe employed by an informed operating system to complement the hardware power management strategies. We perform experiments using two complementary simulation envir onments: a tracedriven simulator with workload traces that are representative of mobile computing and an execution-driven simulator with a detaile d processor/memory model and a more memoryintensive set of benchmarks (SPEC2000). Our r esults make a compelling case for a cooperative hardwar e/software approach for exploiting power-aware memory, with down to as little as 45 % of the Energy Delay for the best static policy and 1 % to 20 % of the Ener gyDelay for a traditional fullpower memory. 1.
Data Transformations for Eliminating Conflict Misses
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses ..."
Abstract
-
Cited by 118 (12 self)
- Add to MetaCart
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.
- In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1994
"... We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures ..."
Abstract
-
Cited by 113 (1 self)
- Add to MetaCart
We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures that are prone to false sharing and choose an appropriate transformation to reduce it. The algorithms eliminated an average (across the entire workload) of 64% of false sharing misses, and in two programs more than 90%. However, how well the reduction in false sharing misses translated into improved execution time depended heavily on the memory subsystem architecture and previous programmer efforts to optimize for locality. On a multiprocessor with a large cache configuration and high cache miss penalty, the transformations improved the execution time of programmer-unoptimized applications by as much as 60%. However, on programs where previous programmer efforts to improve data locality had ...
Adaptive Cache Coherency for Detecting Migratory Shared Data
- In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically i ..."
Abstract
-
Cited by 112 (3 self)
- Add to MetaCart
Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them. The protocols use a standard memory model and processor-cache interface. They do not require any compile-time or run-time software support. We describe implementations for bus-based multiprocessors and for shared-memory multiprocessors that use directory-based caches. These implementations are simple and would not significantly increase hardware cost. We use trace- and execution-driven simulation to compare the performance of the adaptive protocols to standard write-invalidate protocols. These simulations indicate that, compared to conventional protocols, the use of the adaptive protocol can almost halve the number of i...
Lazy Release Consistency for Distributed Shared Memory
, 1995
"... A software distributed shared memory (DSM) system allows shared memory parallel programs to execute on networks of workstations. This thesis presents a new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance. The l ..."
Abstract
-
Cited by 95 (0 self)
- Add to MetaCart
A software distributed shared memory (DSM) system allows shared memory parallel programs to execute on networks of workstations. This thesis presents a new class of protocols that has lower communication requirements than previous DSM protocols, and can consequently achieve higher performance. The lazy release consistent protocols achieve this reduction in communication by piggybacking consistency information on top of existing synchronization transfers. Some of the protocols also improve performance by speculatively moving data. We evaluate the impact of these features by comparing the performance of a software DSM using lazy protocols with that of a DSM using previous eager protocols. We found that seven of our eight applications performed better on the lazy system, and four of the applications showed performance speedups of at least 18%. As part of this comparison, we show that the cost of executing the slightly more complex code of the lazy protocols is far less important than the ...
Efficient Distributed Shared Memory Based On Multi-Protocol Release Consistency
, 1994
"... A distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the s ..."
Abstract
-
Cited by 61 (5 self)
- Add to MetaCart
A distributed shared memory (DSM) system allows shared memory parallel programs to be executed on distributed memory multiprocessors. The challenge in building a DSM system is to achieve good performance over a wide range of shared memory programs without requiring extensive modifications to the source code. The performance challenge translates into reducing the amount of communication performed by the DSM system to that performed by an equivalent message passing program. This thesis describes four novel techniques for reducing the communication overhead of DSM, including: (i) the use of software release consistency, (ii) support for multiple consistency protocols, (iii) a multiple writer protocol, and (iv) an update timeout mechanism. Release consistency allows modifications of shared data to be handled via a delayed update queue, which masks network latencies. Providing multiple cons...
Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design
- JOURNAL OF SUPERCOMPUTING
, 1993
"... We introduce the concept of Hierarchical Clustering as a way to structure shared memory multiprocessor operating systems for scalability. As the name implies, the concept is based on clustering and hierarchical system design. Hierarchical Clustering leads to a modular system, composed of easy-tode ..."
Abstract
-
Cited by 57 (18 self)
- Add to MetaCart
We introduce the concept of Hierarchical Clustering as a way to structure shared memory multiprocessor operating systems for scalability. As the name implies, the concept is based on clustering and hierarchical system design. Hierarchical Clustering leads to a modular system, composed of easy-todesign and efficient building blocks. The resulting structure is scalable because it i) maximizes locality, which is key to good performance in NUMA systems, and ii) provides for concurrency that increases linearly with the number of processors. At the same time, there is tight coupling within a cluster, so the system performs well for local interactions which are expected to constitute the common case. A clustered system can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. We show how this structuring technique is applied to the design of a microkernel-based operating system called HURRICANE. This prototype system is the first complete and running implementation of its kind, and demonstrates the feasibility of a hierarchically clustered system. We present performance results based on the prototype, demonstrating the characteristics and behavior of a clustered system. In particular, we show how clustering trades off the efficiencies of tight coupling for the advantages of replication, increased locality, and decreased lock contention. We describe some of the lessons we learned from our implementation efforts and close with a discussion of our future work.

