Results 1 - 10
of
77
Clearing the Clouds A Study of Emerging Scale-out Workloads on Modern Hardware
"... Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation ..."
Abstract
-
Cited by 69 (8 self)
- Add to MetaCart
(Show Context)
Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today’s predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while today’s predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Neural Acceleration for General-Purpose Approximate Programs
"... This paper describes a learning-based approach to the acceleration of approximate programs. We describe the Parrot transformation, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code ..."
Abstract
-
Cited by 51 (12 self)
- Add to MetaCart
(Show Context)
This paper describes a learning-based approach to the acceleration of approximate programs. We describe the Parrot transformation, a program transformation that selects and trains a neural network to mimic a region of imperative code. After the learning phase, the compiler replaces the original code with an invocation of a low-power accelerator called a neural processing unit (NPU). The NPU is tightly coupled to the processor pipeline to accelerate small code regions. Since neural networks produce inherently approximate results, we define a programming model that allows programmers to identify approximable code regions—code that can produce imprecise but acceptable results. Offloading approximable code regions to NPUs is faster and more energy efficient than executing the original code. For a set of diverse applications, NPU acceleration provides wholeapplication speedup of 2.3 × and energy savings of 3.0 × on average with quality loss of at most 9.6%. 1.
Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers
"... Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written in a way that is portable to many platforms, ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
(Show Context)
Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written in a way that is portable to many platforms, but providing this portability for general programs is a hard problem. By restricting the class of programs considered, we can make this portability feasible. We present Liszt, a domainspecific language for constructing mesh-based PDE solvers. We introduce language statements for interacting with an unstructured mesh, and storing data at its elements. Program analysis of these statements enables our compiler to expose the parallelism, locality, and synchronization of Liszt programs. Using this analysis, we generate applications for multiple platforms: a cluster, an SMP, and a GPU. This approach allows Liszt applications to perform within 12 % of hand-written C++, scale to large clusters, and experience order-of-magnitude speedups on GPUs.
Dynamically Specialized Datapaths for Energy Efficient Computing
"... Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must impro ..."
Abstract
-
Cited by 35 (7 self)
- Add to MetaCart
(Show Context)
Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12 % to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40 % (up to 70%), and 60 % energy reduction if no performance improvement is required. 1
OpenRadio: A Programmable Wireless Dataplane
"... We present OpenRadio, a novel design for a programmable wireless dataplane that provides modular and declarative programming interfaces across the entire wireless stack. Our key conceptual contribution is a principled refactoring of wireless protocols into processing and decision planes. The process ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
(Show Context)
We present OpenRadio, a novel design for a programmable wireless dataplane that provides modular and declarative programming interfaces across the entire wireless stack. Our key conceptual contribution is a principled refactoring of wireless protocols into processing and decision planes. The processing plane includes directed graphs of algorithmic actions (eg. 54Mbps OFDM WiFi or special encoding for video). The decision plane contains the logic which dictates which directed graph is used for a particular packet (eg. picking between data and video graphs). The decoupling provides a declarative interface to program the platform while hiding all underlying complexity of execution. An operator only expresses decision plane rules and corresponding processing plane action graphs to assemble a protocol. The scoped interface allows us to build a dataplane that arguably provides the right tradeoff between performance and flexibility. Our current system is capable of realizing modern wireless protocols (WiFi, LTE) on off-the-shelf DSP chips while providing flexibility to modify the PHY and MAC layers to implement protocol optimizations.
Qscores: Trading dark silicon for scalable energy efficiency with quasi-specific cores
- In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society
, 2011
"... Transistor density continues to increase exponentially, but power dissipation per transistor is improving only slightly with each generation of Moore’s law. Given the constant chip-level power budgets, this exponentially decreases the percentage of transistors that can switch at full frequency with ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
(Show Context)
Transistor density continues to increase exponentially, but power dissipation per transistor is improving only slightly with each generation of Moore’s law. Given the constant chip-level power budgets, this exponentially decreases the percentage of transistors that can switch at full frequency with each technology generation. Hence, while the transistor budget continues to increase exponentially, the power budget has become the dominant limiting factor in processor design. In this regime, utilizing transistors to design specialized cores that optimize energy-per-computation becomes an effective approach to improve system performance. To trade transistors for energy efficiency in a scalable manner, we propose Quasi-specific Cores, or QSCORES, specialized processors capable of executing multiple general-purpose computations while providing an order of magnitude more energy efficiency than a general-purpose processor. The QSCORE design flow is based on the insight that similar code patterns exist within and across applications. Our approach exploits these similar code patterns to ensure that a small set of specialized cores support a large number of commonly used computations. We evaluate QSCORE’s ability to target both a single application library (e.g., data structures) as well as a diverse workload consisting of applications selected from different domains (e.g., SPECINT, EEMBC, and Vision). Our results show that QSCORES can provide 18.4 ⇥ better energy efficiency than general-purpose processors while reducing the amount of specialized logic required to support the workload by up to 66%.
Bundled execution of recurring traces for energy-efficient general purpose processing
- In MICRO 2011
"... Technology scaling has delivered on its promises of increasing device density on a single chip. However, the voltage scaling trend has failed to keep up, introducing tight power constraints on manufactured parts. In such a scenario, there is a need to incorporate energy-efficient processing resource ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
(Show Context)
Technology scaling has delivered on its promises of increasing device density on a single chip. However, the voltage scaling trend has failed to keep up, introducing tight power constraints on manufactured parts. In such a scenario, there is a need to incorporate energy-efficient processing resources that can enable more computation within the same power budget. Energy efficiency solutions in the past have typically relied on application specific hardware and accelerators. Unfortunately, these approaches do not extend to general purpose applications due to their irregular and diverse code base. Towards this end, we propose BERET, an energy-efficient co-processor that can be configured to benefit a wide range of applications. Our approach identifies recurring instruction sequences as phases of "temporal regularity " in a program’s execution, and maps suitable ones to the BERET hardware, a three-stage pipeline with a bundled execution model. This judicious off-loading of program execution to a reduced-complexity hardware demonstrates significant savings on instruction fetch, decode and register file accesses energy. On average, BERET reduces energy consumption by a factor of 3-4X for the program regions selected across a range of general-purpose and media applications. The average energy savings for the entire application run was 35 % over a single-issue in-order processor.
Efficient Complex Operators for Irregular Codes
"... Complex “fat operators ” are important contributors to the efficiency of specialized hardware. This paper introduces two new techniques for constructing efficient fat operators featuring up to dozens of operations with arbitrary and irregular data and memory dependencies. These techniques focus on m ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
(Show Context)
Complex “fat operators ” are important contributors to the efficiency of specialized hardware. This paper introduces two new techniques for constructing efficient fat operators featuring up to dozens of operations with arbitrary and irregular data and memory dependencies. These techniques focus on minimizing critical path length and loaduse delay, which are key concerns for irregular computations. Selective Depipelining(SDP) is a pipelining technique that allows fat operators containing several, possibly dependent, memory operations. SDP allows memory requests to operate at a faster clock rate than the datapath, saving power in the datapath and improving memory performance. Cachelets are small, customized, distributed L0 caches embedded in the datapath to reduce load-use latency. We apply these techniques to Conservation Cores(ccores) to produce coprocessors that accelerate irregular code regions while still providing superior energy efficiency. On average, these enhanced c-cores reduce EDP by 2 × and area by 35 % relative to c-cores. They are up to 2.5 × faster than a general-purpose processor and reduce energy consumption by up to 8 × for a variety of irregular applications including several SPECINT benchmarks. 1
The Accelerator Store: A Shared Memory Framework For Accelerator-Based Systems
, 2012
"... This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we chara ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90 % of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2 % performance, 0%–8 % energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
Meet the Walkers: Accelerating Index Traversals for InMemory Databases
- in Proceedings of MICRO
, 2013
"... The explosive growth in digital data and its growing role in real-time decision support motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. T ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
The explosive growth in digital data and its growing role in real-time decision support motivate the design of high-performance database management systems (DBMSs). Meanwhile, slowdown in supply voltage scaling has stymied improvements in core performance and ushered an era of power-limited chips. These developments motivate the de-sign of DBMS accelerators that (a) maximize utility by ac-celerating the dominant operations, and (b) provide flexibil-ity in the choice of DBMS, data layout, and data types. We study data analytics workloads on contemporary in-memory databases and find hash index lookups to be the largest single contributor to the overall execution time. The critical path in hash index lookups consists of ALU-intensive key hashing followed by pointer chasing through a node list. Based on these observations, we introduce Widx, an on-chip accelerator for database hash index lookups, which achieves both high performance and flexibility by (1) decoupling key hashing from the list traversal, and (2) processing multiple keys in parallel on a set of programmable walker units. Widx reduces design cost and complexity through its tight integra-tion with a conventional core, thus eliminating the need for a dedicated TLB and cache. An evaluation of Widx on a set of modern data analytics workloads (TPC-H, TPC-DS) us-ing full-system simulation shows an average speedup of 3.1x over an aggressive OoO core on bulk hash table operations, while reducing the OoO core energy by 83%.