Results 1 -
5 of
5
M.: Understanding sources of inefficiency in general-purpose chips
- In ISCA ’10: Proceedings of the 37th Annual International SymposiumonComputerArchitecture,toappear(2010),IEEEPress
"... Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose proces ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original fourprocessor CMP. Broadly, applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90 % of the energy used in these solutions is still “overhead”. Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution’s performance within 3x of its energy and within comparable area.
E.: Speculative DMA for Architecturally Visible Storage in Instruction Set Extensions
- In: Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis
, 2008
"... Instruction set extensions (ISEs) can accelerate embedded processor performance. Many algorithms for ISE generation have shown good potential; some of them have recently been expanded to include Architecturally Visible Storage (AVS)—compiler-controlled memories, similar to scratchpads, that are acce ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Instruction set extensions (ISEs) can accelerate embedded processor performance. Many algorithms for ISE generation have shown good potential; some of them have recently been expanded to include Architecturally Visible Storage (AVS)—compiler-controlled memories, similar to scratchpads, that are accessible only to ISEs. To achieve a speedup using AVS, Direct Memory Access (DMA) transfers are required to move data from the main memory to the AVS; unfortunately, this creates coherence problems between the AVS and the cache, which previous methods for ISEs with AVS failed to address; additionally, these methods need to leave many conservative DMA transfers in place, whose execution significantly limits the achievable speedup. This paper presents a memory coherence scheme for ISEs with AVS, which can ensure execution correctness and memory consistency with minimal area overhead. We also present a method that speculatively removes redundant DMA transfers. Cycle-accurate experimental results were obtained using an FPGA-emulation platform. These results show that the application-specific instruction-set extended processors with speculative DMA-enhanced AVS gain significantly over previous techniques, despite the overhead of the coherence mechanism.
Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage
"... Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs). Adoption of the optimal ISE for an application would, in many cases, impose formidable cost increase in order ..."
Abstract
- Add to MetaCart
Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs). Adoption of the optimal ISE for an application would, in many cases, impose formidable cost increase in order to achieve the required data bandwidth. In this paper we propose a novel methodology for laying out data in memories, generating highbandwidth memory systems by making use of existing lowbandwidth low-cost ones and designing custom functional units all with the desirable data bandwidth for only a fraction of the additional cost required by traditional techniques.
2009 12th Euromicro Conference on Digital System Design / Architectures, Methods and Tools Architecture-Driven Synthesis of Reconfigurable Cells
"... In this paper, we present a novel method for merging sets of computational patterns into a reconfigurable cell respecting design constraints and optimizing specific design aspects. Each cell can then be used in a run-time reconfigurable processor extension. Our method uses constraint programming to ..."
Abstract
- Add to MetaCart
In this paper, we present a novel method for merging sets of computational patterns into a reconfigurable cell respecting design constraints and optimizing specific design aspects. Each cell can then be used in a run-time reconfigurable processor extension. Our method uses constraint programming to define the pattern merging problem and therefore can easily include design constraints and optimize different design aspects. Experiments carried out on Media-Bench test suite indicate 50 % average reduction of cell area without increasing critical path.
Multiple Output Complex Instruction Matching Algorithm for Extensible Processors
"... In order to meet the increasing challenges concerning the performance and power demands of embedded applications, a processor is now embedded with the Application-specific functional units. Customized Functional Units both as hardware and the corresponding instructions are embedded to the base proce ..."
Abstract
- Add to MetaCart
In order to meet the increasing challenges concerning the performance and power demands of embedded applications, a processor is now embedded with the Application-specific functional units. Customized Functional Units both as hardware and the corresponding instructions are embedded to the base processor in order to improve the computational efficiency for a target application. During this process of generating the complex instructions and also for the code generation on this extended processor, one of the critical challenges for the compiler is to automatically perform fast and efficient instruction matching and selection. In this project, we developed a novel and efficient algorithm for matching the multiple-output complex Functional Units (FU's). We will also illustrate that the assumption, which is the basis of the most of the current covering methodologies, may not always hold true. Current covering algorithms, generally aim to find the optimal cover within each basic block that minimizes the number of selected matches. Fewer matches translate to fewer operations for the schedule, and it is expected that the increased scheduling freedom leads to better (shorter) schedule. We provide some examples showing that this assumption need not necessarily achieve the goal of minimizing the execution time.

