Results 1 - 10
of
76
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors
"... With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computer systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one c ..."
Abstract
-
Cited by 63 (0 self)
- Add to MetaCart
(Show Context)
With the shift towards chip multiprocessors (CMPs), exploiting and managing parallelism has become a central problem in computer systems. Many issues of parallelism management boil down to discerning which running threads or processes are critical, or slowest, versus which are non-critical. If one can accurately predict critical threads in a parallel program, then one can respond in a variety of ways. Possibilities include running the critical thread at a faster clock rate, performing load balancing techniques to offload work onto currently non-critical threads, or giving the critical thread more on-chip resources to execute faster. This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications. We show that accurate predictors can be built using counters that are typically already available on-chip. Our predictor, based on memory hierarchy statistics, identifies thread criticality with an average accuracy of 93 % across a range of architectures. We also demonstrate two applications of our predictor. First, we show how Intel’s Threading Building Blocks (TBB) parallel runtime system can benefit from task stealing techniques that use our criticality predictor to reduce load imbalance. Using criticality prediction to guide TBB’s taskstealing decisions improves performance by 13-32 % for TBBbased PARSEC benchmarks running on a 32-core CMP. As a second application, criticality prediction guides dynamic energy optimizations in barrier-based applications. By running the predicted critical thread at the full clock rate and frequency-scaling non-critical threads, this approach achieves average energy savings of 15 % while negligibly degrading performance for SPLASH-2 and PARSEC benchmarks.
Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach
"... Abstract—Efficient sharing of system resources is critical to obtaining high utilization and enforcing system-level performance objectives on chip multiprocessors (CMPs). Although several proposals that address the management of a single microarchitectural resource have been published in the literat ..."
Abstract
-
Cited by 61 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Efficient sharing of system resources is critical to obtaining high utilization and enforcing system-level performance objectives on chip multiprocessors (CMPs). Although several proposals that address the management of a single microarchitectural resource have been published in the literature, coordinated management of multiple interacting resources on CMPs remains an open problem. We propose a framework that manages multiple shared CMP resources in a coordinated fashion to enforce higher-level performance objectives. We formulate global resource allocation as a machine learning problem. At runtime, our resource management scheme monitors the execution of each application, and learns a predictive model of system performance as a function of allocation decisions. By learning each application’s performance response to different resource distributions, our approach makes it possible to anticipate the system-level performance impact of allocation decisions at runtime with little runtime overhead. As a result, it becomes possible to make reliable comparisons among different points in a vast and dynamically changing allocation space, allowing us to adapt our allocation decisions as applications undergo phase changes. Our evaluation concludes that a coordinated approach to managing multiple interacting resources is key to delivering high performance in multiprogrammed workloads, but this is possible only if accompanied by efficient search mechanisms. We also show that it is possible to build a single mechanism that consistently delivers high performance under various important performance metrics. I.
A Framework for Providing Quality of Service in Chip Multi-Processors
- In Proc. of the 40th Intl. Symp. on Microarchitecture (MICRO
, 2007
"... The trends in enterprise IT toward service-oriented computing, server consolidation, and virtual computing point to a future in which workloads are becoming increasingly diverse in terms of performance, reliability, and availability requirements. It can be expected that more and more applications wi ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
(Show Context)
The trends in enterprise IT toward service-oriented computing, server consolidation, and virtual computing point to a future in which workloads are becoming increasingly diverse in terms of performance, reliability, and availability requirements. It can be expected that more and more applications with diverse requirements will run on a CMP and share platform resources such as the lowest level cache and off-chip bandwidth. In this environment, it is desirable to have microarchitecture and software support that can provide a guarantee of a certain level of performance, which we refer to as performance Quality of Service. In this paper, we investigate a framework that would be needed for a CMP to fully provide QoS. We found that the ability of a CMP to partition platform resources alone is not sufficient for fully providing QoS. We also need an appropriate way to specify a QoS target, and an admission control policy that accepts jobs only when their QoS targets can be satisfied. We also found that providing strict QoS often leads to a significant reduction in throughput due to resource fragmentation. We propose novel throughput optimization techniques that include: (1) exploiting various QoS execution modes, and (2) a microarchitecture technique that steals excess resources from a job while still meeting its QoS target. We evaluated our QoS framework with a full system simulation of a 4-core CMP and a recent version of the Linux Operating System. We found that compared to an unoptimized scheme, the throughput can be improved by up to 47%, making the throughput significantly closer to a non-QoS CMP. 1
Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks
"... Future chip multiprocessors (CMPs) may have hundreds to thousands of threads competing to access shared resources, and will require quality-of-service (QoS) support to improve system utilization. Although there has been significant work in QoS support within resources such as caches and memory contr ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
(Show Context)
Future chip multiprocessors (CMPs) may have hundreds to thousands of threads competing to access shared resources, and will require quality-of-service (QoS) support to improve system utilization. Although there has been significant work in QoS support within resources such as caches and memory controllers, there has been less attention paid to QoS support in the multi-hop on-chip networks that will form an important component in future systems. In this paper we introduce Globally-Synchronized Frames (GSF), a framework for providing guaranteed QoS in onchip networks in terms of minimum bandwidth and a maximum delay bound. The GSF framework can be easily integrated in a conventional virtual channel (VC) router without significantly increasing the hardware complexity. We rely on a fast barrier network, which is feasible in an on-chip environment, to efficiently implement GSF. Performance guarantees are verified by both analysis and simulation. According to our simulations, all concurrent flows receive their guaranteed minimum share of bandwidth in compliance with a given bandwidth allocation. The average throughput degradation of GSF on a 8×8 mesh network is within 10 % compared to the conventional best-effort VC router in most cases. 1
Resource Allocation Algorithms for Virtualized Service Hosting Platforms
, 2010
"... Commodity clusters are used routinely for deploying service hosting platforms. Due to hardware and operation costs, clusters need to be shared among multiple services. Crucial for enabling such shared hosting platforms is virtual machine (VM) technology, which allows consolidation of hardware resour ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
(Show Context)
Commodity clusters are used routinely for deploying service hosting platforms. Due to hardware and operation costs, clusters need to be shared among multiple services. Crucial for enabling such shared hosting platforms is virtual machine (VM) technology, which allows consolidation of hardware resources. A key challenge, however, is to make appropriate decisions when allocating hardware resources to service instances. In this work we propose a formulation of the resource allocation problem in shared hosting platforms for static workloads with servers that provide multiple types of resources. Our formulation supports a mix of best-effort and QoS scenarios, and, via a precisely defined objective function, promotes performance, fairness, and cluster utilization. Further, this formulation makes it possible to compute a bound on the optimal resource allocation. We propose several classes of resource allocation algorithms, which we evaluate in simulation. We are able to identify an algorithm that achieves average performance close to the optimal across many experimental scenarios. Furthermore, this algorithm runs in only a few seconds for large platforms and thus is usable in practice.
ResourceFreeing Attacks: Improve Your Cloud Performance (at Your Neighbor’s Expense
- In ACM CCS
, 2012
"... Cloud computing promises great efficiencies by multiplexing resources among disparate customers. For example, Amazon’s Elastic Compute Cloud (EC2), Microsoft Azure, Google’s Compute Engine, and Rackspace Hosting all offer Infrastructure as a Service (IaaS) solutions that pack multiple customer virtu ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
(Show Context)
Cloud computing promises great efficiencies by multiplexing resources among disparate customers. For example, Amazon’s Elastic Compute Cloud (EC2), Microsoft Azure, Google’s Compute Engine, and Rackspace Hosting all offer Infrastructure as a Service (IaaS) solutions that pack multiple customer virtual machines (VMs) onto the same physical server. The gained efficiencies have some cost: past work has shown that the performance of one customer’s VM can suffer due to interference from another. In experiments on a local testbed, we found that the performance of a cache-sensitive benchmark can degrade by more than 80 % because of interference from another VM. This interference incentivizes a new class of attacks, that we call resource-freeing attacks (RFAs). The goal is to modify the workload of a victim VM in a way that frees up resources for the attacker’s VM. We explore in depth a particular example of an RFA. Counter-intuitively, by adding load to a co-resident victim, the attack speeds up a class of cache-bound workloads. In a controlled lab setting we show that this can improve performance of synthetic benchmarks by up to 60 % over not running the attack. In the noisier setting of Amazon’s EC2, we still show improvements of up to 13%.
An approach to resource-aware coscheduling for cmps.
- In Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10,
, 2010
"... ABSTRACT We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale nonuniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but perform ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT We develop real-time scheduling techniques for improving performance and energy for multiprogrammed workloads that scale nonuniformly with increasing thread counts. Multithreaded programs generally deliver higher throughput than single-threaded programs on chip multiprocessors, but performance gains from increasing threads decrease when there is contention for shared resources. We use analytic metrics to derive local search heuristics for creating efficient multiprogrammed, multithreaded workload schedules. Programs are allocated fewer cores than requested, and scheduled to space-share the CMP to improve global throughput. Our holistic approach attempts to co-schedule programs that complement each other with respect to shared resource consumption. We find application co-scheduling for performance and energy in a resource-aware manner achieves better results than solely targeting total throughput or concurrently co-scheduling all programs. Our schedulers improve overall energy delay (E*D) by a factor of 1.5 over time-multiplexed gang scheduling.
Contention Aware Execution: Online Contention Detection and Response
"... Cross-core application interference due to contention for shared on-chip and off-chip resources pose a significant challenge to providing application level quality of service (QoS) guarantees on commodity multicore micro-architectures. Unexpected cross-core interference is especially problematic whe ..."
Abstract
-
Cited by 24 (11 self)
- Add to MetaCart
Cross-core application interference due to contention for shared on-chip and off-chip resources pose a significant challenge to providing application level quality of service (QoS) guarantees on commodity multicore micro-architectures. Unexpected cross-core interference is especially problematic when considering latencysensitive applications that are present in the web service data center application domains, such as web-search. The commonly used solution is to simply disallow the co-location of latency-sensitive applications and throughput-oriented batch applications on a single chip, leaving much of the processing capabilities of multicore micro-architectures underutilized. In this work we present a Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes crosscore interference due to contention, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current multicore processors to infer and respond to contention and requires no added hardware support. We present the design and implementation of the CAER environment, two separate contention detection heuristics, and approaches to respond to contention online. We evaluate our solution using the SPEC2006 benchmark suite. Our experiments show that when allowing colocation with CAER, as opposed to disallowing co-location, we are able to increase the utilization of the multicore CPU by 58 % on average. Meanwhile CAER brings the overhead due to allowing co-location from 17 % down to just 4 % on average. Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming—parallel programming; D.3.4 [Programming Languages]: Processors—code generation, runtime environments, compilers, optimization; D.4.8 [Operating Systems]: Performance—measurements, monitors
Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip
"... Future many-core chip multiprocessors (CMPs) and systemson-a-chip (SOCs) will have numerous processing elements executing multiple applications concurrently. These applications and their respective threads will interfere at the on-chip network level and compete for shared resources such as cache ban ..."
Abstract
-
Cited by 22 (8 self)
- Add to MetaCart
(Show Context)
Future many-core chip multiprocessors (CMPs) and systemson-a-chip (SOCs) will have numerous processing elements executing multiple applications concurrently. These applications and their respective threads will interfere at the on-chip network level and compete for shared resources such as cache banks, memory controllers, and specialized accelerators. Often, the communication and sharing patterns of these applications will be impossible to predict off-line, making fairness guarantees and performance isolation difficult through static thread and link scheduling. Prior techniques for providing network quality-of-service (QOS) have too much algorithmic complexity, cost (area and/or energy) or performance overhead to be attractive for on-chip implementation. To better understand the preferred solution space, we define desirable features and evaluation metrics for QOS in a network-on-a-chip (NOC). Our insights lead us to propose a novel QOS system called Preemptive Virtual Clock (PVC). PVC provides strong guarantees, reduces packet delay variation, and enables efficient reclamation of idle network bandwidth without per-flow buffering at the routers and with minimal buffering at the source nodes. PVC averts priority inversion through preemption of lower-priority packets. By controlling preemption aggressiveness, PVC enables a trade-off between the strength of the guarantees and overall throughput. Finally, PVC simplifies network management through a flexible allocation mechanism that enables perapplication bandwidth provisioning independent of thread count and supports transparent bandwidth recycling among an application’s threads.
Coordinated Control of Multiple Prefetchers in Multi-Core Systems
, 2009
"... Aggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference with prefetch and demand accesses of other cor ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
Aggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference with prefetch and demand accesses of other cores. Because existing prefetcher throttling techniques do not address this prefetcher-caused inter-core interference, aggressive prefetching in multi-core systems can lead to significant performance degradation and wasted bandwidth consumption. To make prefetching effective in CMPs, this paper proposes a low-cost mechanism to control prefetcher-caused inter-core interference by dynamically adjusting the aggressiveness of multiple cores ’ prefetchers in a coordinated fashion. Our solution consists of a hierarchy of prefetcher aggressiveness control structures that combine per-core (local) and prefetcher-caused intercore (global) interference feedback to maximize the benefits of prefetching on each core while optimizing overall system performance. These structures improve system performance by 23% while reducing bus traffic by 17 % compared to employing aggressive prefetching and improve system performance by 14% compared to a state-of-the-art prefetcher aggressiveness control technique on an eight-core system.