Results 1 - 10
of
27
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs
- In MICRO-43: Proceedings of the 43th Annual IEEE/ACM International Symposium on Microarchitecture
, 2010
"... To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cor ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also diminish given the scarcity of projected bandwidth in the future. To understand the relative merits between different approaches in the face of technology constraints, this work builds on prior modeling of heterogeneous multicores to support U-cores. Unlike prior models that trade performance, power, and area using well-known relationships between simple and complex processors, our model must consider the less-obvious relationships between conventional processors and a diverse set of U-cores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon U-core-specific parameters derived by measuring performance and power of tuned applications on today’s state-of-the-art multicores, GPUs, FPGAs, and ASICs. Our results reinforce some current-day understandings of the potential and limitations of U-cores and also provides new insights on their relative merits. 1.
Dark Silicon and the End of Multicore Scaling
"... Since 2005, processor designers have increased core counts to exploit Moore’s Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Since 2005, processor designers have increased core counts to exploit Moore’s Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model singlecore scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lowerbound core power. The multicore designs we study include singlethreaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21 % of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9 × average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.
Efficient Complex Operators for Irregular Codes
"... Complex “fat operators ” are important contributors to the efficiency of specialized hardware. This paper introduces two new techniques for constructing efficient fat operators featuring up to dozens of operations with arbitrary and irregular data and memory dependencies. These techniques focus on m ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Complex “fat operators ” are important contributors to the efficiency of specialized hardware. This paper introduces two new techniques for constructing efficient fat operators featuring up to dozens of operations with arbitrary and irregular data and memory dependencies. These techniques focus on minimizing critical path length and loaduse delay, which are key concerns for irregular computations. Selective Depipelining(SDP) is a pipelining technique that allows fat operators containing several, possibly dependent, memory operations. SDP allows memory requests to operate at a faster clock rate than the datapath, saving power in the datapath and improving memory performance. Cachelets are small, customized, distributed L0 caches embedded in the datapath to reduce load-use latency. We apply these techniques to Conservation Cores(ccores) to produce coprocessors that accelerate irregular code regions while still providing superior energy efficiency. On average, these enhanced c-cores reduce EDP by 2 × and area by 35 % relative to c-cores. They are up to 2.5 × faster than a general-purpose processor and reduce energy consumption by up to 8 × for a variety of irregular applications including several SPECINT benchmarks. 1
Forwardflow: a scalable core for power-constrained cmps
- In Proc. 37th Intl. Symp. on Computer Architecture
, 2010
"... Chip Multiprocessors (CMPs) are now commodity hardware, but commoditization of parallel software remains elusive. In the near term, the current trend of increased coreper-socket count will continue, despite a lack of parallel software to exercise the hardware. Future CMPs must deliver thread-level p ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Chip Multiprocessors (CMPs) are now commodity hardware, but commoditization of parallel software remains elusive. In the near term, the current trend of increased coreper-socket count will continue, despite a lack of parallel software to exercise the hardware. Future CMPs must deliver thread-level parallelism when software provides threads to run, but must also continue to deliver performance gains for single threads by exploiting instructionlevel parallelism and memory-level parallelism. However, power limitations will prevent conventional cores from exploiting both simultaneously. This work presents the Forwardflow Architecture, which can scale its execution logic up to run single threads, or down to run multiple threads in a CMP. Forwardflow dynamically builds an explicit internal dataflow representation from a conventional instruction set architecture, using forward dependence pointers to guide instruction wakeup, selection, and issue. Forwardflow’s backend is organized into discrete units that can be individually (de-)activated, allowing each core’s performance to be scaled by system software at the architectural level. On single threads, Forwardflow core scaling yields a mean runtime reduction of 21 % for a 37 % increase in power consumption. For multithreaded workloads, a Forwardflowbased CMP allows system software to select the performance point that best matches available power.
Clearing the Clouds A Study of Emerging Scale-out Workloads on Modern Hardware
"... Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today’s predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while today’s predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Qscores: Trading dark silicon for scalable energy efficiency with quasi-specific cores
- In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society
, 2011
"... Transistor density continues to increase exponentially, but power dissipation per transistor is improving only slightly with each generation of Moore’s law. Given the constant chip-level power budgets, this exponentially decreases the percentage of transistors that can switch at full frequency with ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Transistor density continues to increase exponentially, but power dissipation per transistor is improving only slightly with each generation of Moore’s law. Given the constant chip-level power budgets, this exponentially decreases the percentage of transistors that can switch at full frequency with each technology generation. Hence, while the transistor budget continues to increase exponentially, the power budget has become the dominant limiting factor in processor design. In this regime, utilizing transistors to design specialized cores that optimize energy-per-computation becomes an effective approach to improve system performance. To trade transistors for energy efficiency in a scalable manner, we propose Quasi-specific Cores, or QSCORES, specialized processors capable of executing multiple general-purpose computations while providing an order of magnitude more energy efficiency than a general-purpose processor. The QSCORE design flow is based on the insight that similar code patterns exist within and across applications. Our approach exploits these similar code patterns to ensure that a small set of specialized cores support a large number of commonly used computations. We evaluate QSCORE’s ability to target both a single application library (e.g., data structures) as well as a diverse workload consisting of applications selected from different domains (e.g., SPECINT, EEMBC, and Vision). Our results show that QSCORES can provide 18.4 ⇥ better energy efficiency than general-purpose processors while reducing the amount of specialized logic required to support the workload by up to 66%.
Reducing the Energy Cost of Irregular Code Bases in Soft Processor Systems
"... Abstract — This paper describes an architecture and FPGA synthesis toolchain for building specialized, energy-saving coprocessors called Irregular Code Energy Reducers (ICERs) for a wide range of unmodified C programs. FPGAs are increasingly used to build large-scale systems, and many large software ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract — This paper describes an architecture and FPGA synthesis toolchain for building specialized, energy-saving coprocessors called Irregular Code Energy Reducers (ICERs) for a wide range of unmodified C programs. FPGAs are increasingly used to build large-scale systems, and many large software systems contain relatively little code that is amenable to automatic, semi-automatic, or even manual parallelization. Whereas accelerator approaches have traditionally achieved energy benefits as a side effect from increasing performance via parallel execution, ICERs aim to achieve energy gains even on code with little exploitable parallelism. Traditional approaches to automatically generating accelerators from existing software rely on inferring parallel execution from serial code, so they face the same code analysis challenges as parallelizing compilers. In contrast, because the ICER approach targets energy rather than performance, it easily scales to large, irregular applications that are poor candidates for traditional acceleration. Our results show that, compared to a baseline system with soft processor cores, ICERs can reduce energy consumption by up to 9.5 × for the code they target and 2.8 × for whole applications. Keywords-Accelerator architectures; Reconfigurable architectures; Energy efficiency; High level synthesis I.
An Evaluation of Selective Depipelining for FPGA-based Energy-Reducing Irregular Code Coprocessors
"... Abstract — As the complexity of FPGA-based systems scales, the importance of efficiently handling irregular code increases. Recent work has proposed Irregular Code Energy Reducers (ICERs), a high-level synthesis approach for FPGAs that offers significant energy reduction for irregular code compared ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract — As the complexity of FPGA-based systems scales, the importance of efficiently handling irregular code increases. Recent work has proposed Irregular Code Energy Reducers (ICERs), a high-level synthesis approach for FPGAs that offers significant energy reduction for irregular code compared to a soft core processor. ICERs target the hot-spots of programs, and are seamlessly connected via a shared L1 cache with a soft processor that executes the cold code. This paper evaluates the application of the selective depipelining (SDP) technique to ICERs, which greatly reduces both the execution time and energy of irregular computations. SDP enables irregular computations to be expressed as large, fast, low-power combinational blocks. SDP maintains high memory bandwidth by scheduling the many potentially dependent memory operations within these blocks onto a high-frequency, highly-multiplexed coherent memory while scheduling combinational operations at a much lower frequency. SDP is a key enabler for improving the execution properties of irregular computations that are difficult to parallelize. We show that applying SDP to ICERs reduces energy-delay by 2.62 × relative to ICERs. ICERs with SDP are up to 2.38 × faster than a soft core processor and reduce energy consumption by up to 15.83 × for a variety of irregular applications. I.
The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software
"... Abstract—On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to m ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics. On the software side, managed workloads written in languages such as C#, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or justin-time compilation) that together impose substantial energy and performance costs, ranging from 10 % to over 80%. We show that these services manifest a differentiated performance and power workload. To differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application’s critical path. We identify a synergy between AMP and VM services that we exploit to attack the 40 % average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant. I.
Scalable Cores in Chip Multiprocessors
"... Chip design is at an inflection point. It is now clear that chip multiprocessors (CMPs) will dominate product offerings for the forseeable future. Such designs integrate many processing cores onto a single chip. However, debate remains about the relative merits of explicit software threading necesar ..."
Abstract
- Add to MetaCart
Chip design is at an inflection point. It is now clear that chip multiprocessors (CMPs) will dominate product offerings for the forseeable future. Such designs integrate many processing cores onto a single chip. However, debate remains about the relative merits of explicit software threading necesary to use these designs. At the same time, the pursuit of improved performance for single threads must continue, as legacy applications and hard-to-parallelize codes will remain important. These concerns lead computer architects to a quandary with each new design. Too much focus on per-core performance will fail to encourage software (and software developers) to migrate programs toward explicit concurrency; too little focus on cores will hurt performance of vital existing applications. Furthermore, because future chips will be constrained by power, it may not be possible to deploy both aggressive cores and many hardware threads in the same chip. To address the need for chips delivering both high single-thread performance and many hardware threads, this thesis evaluates ScalableCoresinChipMultiprocessors:CMPsequippedwith cores that deliver high-performance (at high per-core power) when the situation merits, but can also operate at lower-power modes, to enable concurrent execution of many threads. Toward this vision, I make several contributions. First, I discuss a method for representing inter-instruction

