Results 1 - 10
of
34
Helios: Heterogeneous multiprocessing with satellite kernels
- In Proceedings of the 22nd ACM Symposium on Operating Systems Principles
, 2009
"... Helios is an operating system designed to simplify the task of writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, uniform set of OS abstractions across CPUs of disparate architectures and performance characteristics. A ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Helios is an operating system designed to simplify the task of writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, uniform set of OS abstractions across CPUs of disparate architectures and performance characteristics. Access to I/O services such as file systems are made transparent via remote message passing, which extends a standard microkernel message-passing abstraction to a satellite kernel infrastructure. Helios retargets applications to available ISAs by compiling from an intermediate language. To simplify deploying and tuning application performance, Helios exposes an affinity metric to developers. Affinity provides a hint to the operating system about whether a process would benefit from executing on the same platform as a service it depends upon. We developed satellite kernels for an XScale programmable I/O card and for cache-coherent NUMA architectures. We offloaded several applications and operating system components, often by changing only a single line of metadata. We show up to a 28% performance improvement by offloading tasks to the XScale I/O card. On a mail-server benchmark, we show a 39 % improvement in performance by automatically splitting the application among multiple NUMA domains.
Hardware support for spin management in overcommitted virtual machines
- In Proc. of 15th PACT
, 2006
"... Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS’s virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is inf ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS’s virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is infeasible for some application domains, and is inflexible in other domains. In an overcommitted environment, an individual guest OS has more VCPUs than available physical processors (PCPUs), precluding the use of gang scheduling. In such an environment, we demonstrate a more than two-fold increase in runtime when transparently virtualizing a chipmultiprocessor’s cores. To combat this problem, we propose a hardware technique to detect several cases when a VCPU is not performing useful work, and suggest preempting that VCPU to run a different, more productive VCPU. Our technique can dramatically reduce cycles wasted on OS synchronization, without requiring any semantic information from the software. We then present a case study, typical of server consolidation, to demonstrate the potential of more flexible scheduling policies enabled by our technique. We propose one such policy that logically partitions the CMP cores between guest VMs. This policy increases throughput by 10–25 % for consolidated server workloads due to improved cache locality and core utilization, and substantially improves performance isolation in private caches.
HMTT: a platform independent full-system memory trace monitoring system
- In SIGMETRICS ’08: Proceedings of the 2008 ACM SIGMETRICS international
, 2008
"... Memory trace analysis is an important technology for architecture research, system software (i.e., OS, compiler) optimization, and application performance improvements. Many approaches have been used to track memory trace, such as simulation, binary instrumentation and hardware snooping. However, th ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Memory trace analysis is an important technology for architecture research, system software (i.e., OS, compiler) optimization, and application performance improvements. Many approaches have been used to track memory trace, such as simulation, binary instrumentation and hardware snooping. However, they usually have limitations of time, accuracy and capacity. In this paper we propose a platform independent memory trace monitoring system, which is able to track virtual memory reference trace of full systems (including OS, VMMs, libraries, and applications). The system adopts a DIMM-snooping mechanism that uses hardware boards plugged in DIMM slots to snoop. There are several advantages in this approach, such as fast, complete, undistorted, and portable. Three key techniques are proposed to address the system design challenges with this
Ganev, “Re-architecting VMMs for Multicore Systems: The Sidecore Approach
"... Abstract — Future many-core platforms present scalability challenges to VMMs, including the need to efficiently utilize their processor and cache resources. Focusing on platform virtualization, we address these challenges by devising new virtualization methods that not only work with, but actually e ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract — Future many-core platforms present scalability challenges to VMMs, including the need to efficiently utilize their processor and cache resources. Focusing on platform virtualization, we address these challenges by devising new virtualization methods that not only work with, but actually exploit the many-core nature of future processors. Specifically, we utilize the fact that cores will differ with respect to their current internal processor and memory states. The hypervisor, or VMM, then leverages these differences to substantially improve VMM performance and better utilize these cores. The key idea underlying this work is simple: to carry out some privileged VMM operation, rather than forcing a core to undergo an expensive internal state change via traps, such as VMexit in Intel’s VT architecture, why not have the operation carried out by a remote core that is already in the appropriate state? Termed the sidecore approach to running VMM-level functionality, it can be used to run VMM services more efficiently on remote cores that are already in VMM state. This paper demonstrates the viability and utility of the sidecore approach for two VMM-level classes of functionality: (1) efficient VM-VMM communication in VT-enabled processors and (2) interrupt virtualization for selfvirtualized devices.
A case for an over-provisioned multicore system: Energy efficient processing of multithreaded programs
, 2007
"... Technology scaling has provided system designers with an exploding transistor budget, far more than what was available when the core principles behind many existing commodity microprocessors were envisioned. With this tremendous growth, however, comes a whole new set of engineering challenges involv ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Technology scaling has provided system designers with an exploding transistor budget, far more than what was available when the core principles behind many existing commodity microprocessors were envisioned. With this tremendous growth, however, comes a whole new set of engineering challenges involving power density, thermal efficiency, programmability and so on. In this paper, we study another important trend in high performance microprocessors: the reduction in the Simultaneously Active Fraction (SAF) — the fraction of the entire chip resources that can be active simultaneously, given a target power envelope. As the improvement in the energy efficiency of individual transistor devices is lagging behind the growth in their integration capacity, we find that the SAF is monotonically decreasing for each successive technology generation. Given this increasing constraint on the SAF, we examine the utility of temporarily suspending computation on a core as a means for reducing the SAF, and hence, remain within the confines of costeffective cooling and power delivery. We investigate a SAF aware over-provisioned multicore system (OPMS), where only a subset of the available cores are employed to perform active computation at any given time, by allowing the individual cores to transition between active and inactive state. Though several possible directions for utilizing such an over-provisioned system are possible, this paper focuses on energy efficient dynamic task redistribution. In particular, this paper examines the use of Computation Spreading—a recently proposed technique for runtime specialization of homogeneous multicores—in an OPMS. We show several benefits for such an OPMS design, including reductions in energy, runtime, and superior thermal characteristics. Overall, our technique improves the energy-delay product of the commercial workloads we examine by 5–20%. 1.
Forwardflow: a scalable core for power-constrained cmps
- In Proc. 37th Intl. Symp. on Computer Architecture
, 2010
"... Chip Multiprocessors (CMPs) are now commodity hardware, but commoditization of parallel software remains elusive. In the near term, the current trend of increased coreper-socket count will continue, despite a lack of parallel software to exercise the hardware. Future CMPs must deliver thread-level p ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Chip Multiprocessors (CMPs) are now commodity hardware, but commoditization of parallel software remains elusive. In the near term, the current trend of increased coreper-socket count will continue, despite a lack of parallel software to exercise the hardware. Future CMPs must deliver thread-level parallelism when software provides threads to run, but must also continue to deliver performance gains for single threads by exploiting instructionlevel parallelism and memory-level parallelism. However, power limitations will prevent conventional cores from exploiting both simultaneously. This work presents the Forwardflow Architecture, which can scale its execution logic up to run single threads, or down to run multiple threads in a CMP. Forwardflow dynamically builds an explicit internal dataflow representation from a conventional instruction set architecture, using forward dependence pointers to guide instruction wakeup, selection, and issue. Forwardflow’s backend is organized into discrete units that can be individually (de-)activated, allowing each core’s performance to be scaled by system software at the architectural level. On single threads, Forwardflow core scaling yields a mean runtime reduction of 21 % for a 37 % increase in power consumption. For multithreaded workloads, a Forwardflowbased CMP allows system software to select the performance point that best matches available power.
Software Data Spreading: Leveraging Distributed Caches to Improve Single Thread Performance
"... Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Single thread performance remains an important consideration even for multicore, multiprocessor systems. As a result, techniques for improving single thread performance using multiple cores have received considerable attention. This work describes a technique, software data spreading, that leverages the cache capacity of extra cores and extra sockets rather than their computational resources. Software data spreading is a software-only technique that uses compiler-directed thread migration to aggregate cache capacity across cores and chips and improve performance. This paper describes an automated scheme that applies data spreading to various types of loops. Experiments with a set of SPEC2000, SPEC2006, NAS, and microbenchmark workloads show that data spreading can provide speedup of over 2, averaging 17 % for the SPEC and NAS applications on two systems. In addition, despite using more cores for the same computation, data spreading actually saves power since it reduces access to DRAM. D.3.4 [Programming Lan-Categories and Subject Descriptors
Serializing Instructions in System-Intensive Workloads: Amdahl’s Law Strikes Again
- 14TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA ’08)
, 2008
"... Serializing instructions (SIs), such as writes to control registers, have many complex dependencies, and are difficult to execute out-of-order (OoO). To avoid unnecessary complexity, processors often serialize the pipeline to maintain sequential semantics for these instructions. We observe frequent ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Serializing instructions (SIs), such as writes to control registers, have many complex dependencies, and are difficult to execute out-of-order (OoO). To avoid unnecessary complexity, processors often serialize the pipeline to maintain sequential semantics for these instructions. We observe frequent SIs across several system-intensive workloads and three ISAs, SPARC V9, X86-64, and PowerPC. As explained by Amdahl’s Law, these SIs, which create serial regions within the instruction-level parallel execution of a single thread, can have a significant impact on performance. For the SPARC ISA (after removing TLB and register window effects), we show that operating system (OS) code incurs a 8–45 % performance drop from SIs. We observe that the values produced by most control register writes are quickly consumed, but the writes are often effectively useless (EU), i.e., they do not actually change the execution of the consuming instructions. We propose EU prediction, which allows younger instructions to proceed, possibly reading a stale value, and yet still execute correctly. This technique improves the performance of OS code by 6–35%, and overall performance by 2–12%.
OS Execution on Multi-Cores: Is Out-Sourcing Worthwhile?
"... Large-scale multi-core chips open up the possibility of implementing heterogeneous cores on a single chip, where some cores can be customized to execute common code patterns. The operating system is an example of a common code pattern that is constantly executing on every processor. It is therefore ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Large-scale multi-core chips open up the possibility of implementing heterogeneous cores on a single chip, where some cores can be customized to execute common code patterns. The operating system is an example of a common code pattern that is constantly executing on every processor. It is therefore a prime candidate for core customization. Recent work has begun to explore this possibility, where some fraction of system calls and other OS functionality is off-loaded to a separate special-purpose core. Studies have shown that this can improve overall system performance and power consumption. However, our explorations in this arena reveal that the primary benefits of off-loading can be captured with alternative mechanisms that eliminate the negative effects of off-loading. This position paper articulates this alternative mechanism with initial results that demonstrate promise. 1.
A Helper Thread Based EDP Reduction Scheme for Adapting Application Execution in CMPs
"... In parallel to the changes in both the architecture domain – the move toward chip multiprocessors (CMPs) – and the application domain – the move toward increasingly data-intensive workloads – issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The C ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In parallel to the changes in both the architecture domain – the move toward chip multiprocessors (CMPs) – and the application domain – the move toward increasingly data-intensive workloads – issues such as performance, energy efficiency and CPU availability are becoming increasingly critical. The CPU availability can change dynamically due to several reasons such as thermal overload, increase in transient errors, or operating system scheduling. An important question in this context is how to adapt, in a CMP, the execution of a given application to CPU availability change at runtime. Our paper studies this problem, targeting the energy-delay product (EDP) as the main metric to optimize. We first discuss that, in adapting the application execution to the varying CPU availability, one needs to consider the number of CPUs to use, the number of application threads to accommodate and the voltage/frequency levels to employ (if the CMP has this capability). We then propose to use helper threads to adapt the application execution to CPU availability change in general with the goal of minimizing the EDP. The helper thread runs parallel to the application execution threads and tries to determine the ideal number of CPUs, threads and voltage/frequency levels to employ at any given point in execution. We illustrate this idea using two applications (Fast Fourier Transform and MultiGrid) under different execution scenarios. The results collected through our experiments are very promising and indicate that significant EDP reductions are possible using helper threads. For example, we achieved up to 66.3 % and 83.3 % savings in EDP when adjusting all the parameters properly in applications FFT and MG, respectively. 1

