Results 1 - 10
of
30
Understanding sources of inefficiency in general-purpose chips
- IN ISCA ’10: PROCEEDINGS OF THE 37TH ANNUAL INTERNATIONAL SYMPOSIUMONCOMPUTERARCHITECTURE,TOAPPEAR(2010),IEEEPRESS
"... Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose proces ..."
Abstract
-
Cited by 77 (2 self)
- Add to MetaCart
(Show Context)
Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original fourprocessor CMP. Broadly, applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90 % of the energy used in these solutions is still “overhead”. Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution’s performance within 3x of its energy and within comparable area.
MOVE-Pro: a low power and high code density TTA architecture ( Springer-Verlag
- H Corporaal, in Proceedings of the 11th International Conference on Embedded Computer Systems (SAMOS-XI
, 2011
"... Abstract—Transport Triggered Architectures (TTAs) possess many advantageous, such as modularity, flexibility, and scalabil-ity. As an exposed datapath architecture, TTAs can effectively reduce the register file (RF) pressure in both number of accesses and number of RF ports. However, the conventiona ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Transport Triggered Architectures (TTAs) possess many advantageous, such as modularity, flexibility, and scalabil-ity. As an exposed datapath architecture, TTAs can effectively reduce the register file (RF) pressure in both number of accesses and number of RF ports. However, the conventional TTAs also have some evident disadvantages, such as relative low code density, dynamic-power wasting due to separate scheduling of source operands, and inefficient support for variant immediate values. In order to preserve the merit of conventional TTAs, while solving these aforementioned issues, we propose, MOVE-Pro, a novel low power and high code density TTA architecture. With optimizations at instruction set architecture (ISA), architecture, circuit, and compiler levels, the low-power potential of TTAs is fully exploited. Moreover, with a much denser code size, TTAs performance is also improved accordingly. In a head-to-head comparison between a two-issue MOVE-Pro processor and its RISC counterpart, we shown that up to 80 % of RF accesses can be reduced, and the reduction in RF power is successfully transferred to the total core power saving. Up to 11 % reduction of the total core power is achieved by our MOVE-Pro processor, while the code density is almost the same as its RISC counterpart.
Libra: Tailoring SIMD Execution using Heterogeneous HardwareandDynamicConfigurability ∗
"... Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these deviceswillbedrivenbyprovidinganevenricheruserexperienceand compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, games, and voice interfac ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Mobile computing as exemplified by the smart phone has become an integral part of our daily lives. The next generation of these deviceswillbedrivenbyprovidinganevenricheruserexperienceand compelling capabilities: higher definition multimedia, 3D graphics, augmented reality, games, and voice interfaces. To address these goals, the core computing capabilities of the smart phone must be scaled. However,theenergy budgets areincreasingatamuchlower rate, requiring fundamental improvements in computing efficiency. SIMD accelerators offer the combination of high performance and low energy consumption through low control and interconnect overhead. However, SIMD accelerators are not a panacea. Many applications lack sufficient vector parallelism to effectively utilize a large number of SIMD lanes. Further, the use of symmetric hardwarelanesleadstolowutilizationandhighstaticpowerdissipation as SIMD width is scaled. To address these inefficiencies, this paper focuses on breaking two traditional rules of SIMD processing: homogeneity andstaticconfiguration. TheLibraaccelerator increases SIMD utility by blurring the divide between vector and instruction parallelism to support efficient execution of a wider range of loops, and it increases hardware utilization through the use of heterogeneous hardware across the SIMD lanes. Experimental results show that the 32-lane Libra outperforms traditional SIMD accelerators byanaverageof1.58xperformanceimprovementduetohigherloop coverage with29 % less energy consumption through heterogeneous hardware. 1.
WebCore: Architectural support for mobile Web browsing
- In Proc. the 2014 ACM/IEEE 41st International Symposium on Computer Architecture
, 2014
"... The Web browser is undoubtedly the single most impor-tant application in the mobile ecosystem. An average user spends 72 minutes each day using the mobile Web browser. Web browser internal engines (e.g., WebKit) are also growing in importance because they provide a common substrate for developing va ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
The Web browser is undoubtedly the single most impor-tant application in the mobile ecosystem. An average user spends 72 minutes each day using the mobile Web browser. Web browser internal engines (e.g., WebKit) are also growing in importance because they provide a common substrate for developing various mobile Web applications. In a user-driven, interactive, and latency-sensitive environment, the browser’s performance is crucial. However, the battery-constrained na-ture of mobile devices limits the performance that we can de-liver for mobile Web browsing. As traditional general-purpose techniques to improve performance and energy efficiency fall short, we must employ domain-specific knowledge while still maintaining general-purpose flexibility. In this paper, we first perform design-space exploration to identify appropriate general-purpose architectures that uniquely fit the characteristics of a popular Web browsing engine. Despite our best effort, we discover sources of energy inefficiency in these customized general-purpose architectures. To mitigate these inefficiencies, we propose, synthesize, and evaluate two new domain-specific specializations, called the Style Resolution Unit and the Browser Engine Cache. Our opti-mizations boost energy efficiency and at the same time improve mobile Web browsing performance. As emerging mobile work-loads increasingly rely more on Web browser technologies, the type of optimizations we propose will become important in the future and are likely to have lasting widespread impact. 1.
SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures
"... Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scie ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels of data-level parallelism (DLP). However, vectorization is often much less effective for media applications due to low trip count loops, complex control flow, and non-uniform execution behavior. As a result, SIMD lanes remain idle due to insufficient DLP. To attack this problem, this paper proposes a new vectorization pass called SIMD Defragmenter to uncover hidden DLP that lurks below the surface in the form of instruction-level parallelism (ILP). The difficulty is managing the data packing/unpacking overhead that can easily exceed the benefits gained through SIMD execution. The SIMD degragmenter overcomes this problem by identifying groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. By SIMDizing in bulk at the subgraph level, packing/unpacking overhead is minimized. On a 16-lane SIMD processor, experimental results show that SIMD defragmentation achieves a mean 1.6x speedup over traditional loop vectorization and a 31 % gain over prior research approaches for converting ILP to DLP.
On-Chip Vector Coprocessor Sharing for Multicores,” 19th Euromicro Intern
- 1.5 0.5 25% 50% 75% 100% 40nm 40nm [CG] 65nm
"... Abstract — For most of the applications that make use of a vector coprocessor, the resources are not highly utilized due to the lack of sustained data parallelism, which sometimes occurs due to vector-length changes in dynamic environments. The motivation of our work stems from (a) the mandate for m ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
Abstract — For most of the applications that make use of a vector coprocessor, the resources are not highly utilized due to the lack of sustained data parallelism, which sometimes occurs due to vector-length changes in dynamic environments. The motivation of our work stems from (a) the mandate for multicore designs to make efficient use of the on-chip resources; (b) the frequent presence of vector operations in high-performance scientific and embedded applications; (c) the increased probability that different cores may deal with different vector lengths at various times; and (d) different vector kernels in the same or different application suites may have diverse computation needs. Our objective is to provide a versatile design framework that can facilitate vector coprocessor sharing among multiple cores in a manner that maximizes resource utilization while also yielding very high performance at reduced cost. We propose three basic shared vector coprocessor architectures for multicores based on coarse-grain, fine-grain and vector lane sharing. We benchmark these distinct vector architectures for a dual-core system using the floating-point performance and resource utilization metrics. Our analysis shows that vector lane sharing, where the number of vector lanes assigned to a core can be controlled dynamically, provides the greatest flexibility and generally yields very good results. Since, however, each of the three design choices has its own performance advantages under certain vector-load conditions, we ultimately suggest a hybrid vector coprocessor design that can support all three architectural choices as per the core and application collective needs. Keywords- Vector coprocessor, coprocessor sharing, multicore, FPGA prototyping, MicroBlaze. I.
Customizing Wide-SIMD Architectures for H.264
"... Abstract—In recent years, the mobile phone industry has become one of the most dynamic technology sectors. The increasing demands of multimedia services on the cellular networks have accelerated this trend. This paper presents a low power SIMD architecture that has been tailored for efficient implem ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract—In recent years, the mobile phone industry has become one of the most dynamic technology sectors. The increasing demands of multimedia services on the cellular networks have accelerated this trend. This paper presents a low power SIMD architecture that has been tailored for efficient implementation of H.264 encoder/decoder kernel algorithms. Several customized features have been added to improve the processing performance and lower the power consumption. These include support for different SIMD widths to increase the SIMD utilization efficiency, diagonal memory organization to support both column and row access, temporary buffer and bypass support to reduce the register file power consumption, fused operation support to increase the processing performance, and a fast programmable crossbar to support complex data permutation patterns. The proposed architecture increases the throughput of H.264 encoder/decoder kernel algorithms by a factor of 2.13 while achieving 29 % of energy-delay improvement on average compared to our previous SIMD architecture, SODA. I.
Mighty-Morphing Power-SIMD
"... In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level parallelism, such as encryption. Wide single-instruct ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level parallelism, such as encryption. Wide single-instruction multiple-data (SIMD) processors have become popular for providing high performance, yet power efficient data engines for applications with abundant data parallelism. However, the non-data-parallel applications are relegated to a low-performance scalar datapath on these data engines while the SIMD resources are left idle. To accelerate both types of applications, we propose the design of a more flexible SIMD datapath called SIMD-Morph. In SIMD-Morph, code with datalevel parallelism can be executed across the lanes in the traditional manner, but the lanes can be morphed into a feed-forward subgraph accelerator to execute scalar applications more efficiently. The morphed SIMD lanes form an accelerator that exploits both instruction-level parallelism as well as operation chaining to improve the performance of scalar code by exploiting the available resources in the SIMD lanes. Experimental results show that the performance impact is a 2.6X improvement for purely non-SIMD applications and a 1.4X improvement for the non-SIMD-ized portions of applications with data parallelism.
Performance-Energy Optimizations for Shared Vector Accelerators in Multicores
- preprint: http://www.computer.org/csdl/trans/tc/preprint/06718035.pdf) [14] P. Clarke, “Intel to Package FPGA with Xeon Processor,” http://www.design-reuse.com/news/exit/?id=34847&url=http%3A %2F%2Felectronics360.globalspec.com%2Farticle%2F4313%2Fin tel-t
, 2014
"... Abstract For multicore processors with a private vector coprocessor (VP) per core, VP resources may not be highly utilized due to limited data-level parallelism (DLP) in applications. Also, under low VP utilization static power dominates the total energy consumption. We enhance here our previously ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract For multicore processors with a private vector coprocessor (VP) per core, VP resources may not be highly utilized due to limited data-level parallelism (DLP) in applications. Also, under low VP utilization static power dominates the total energy consumption. We enhance here our previously proposed VP sharing framework for multicores in order to increase VP utilization while reducing the static energy. We describe two power-gating (PG) techniques to dynamically control the VP's width based on utilization figures. Floating-point results on an FPGA prototype show that the PG techniques reduce the energy needs by 30-35% with negligible performance reduction as compared to a multicore with the same amount of hardware resources where, however, each core is attached to a private VP.
Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors
"... Abstract—Wideband channelization is a computationally in-tensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Wideband channelization is a computationally in-tensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy these require-ments. While FPGAs allow for flexible implementations, realiz-ing a FPGA implementation is a difficult and time-consuming process. On the other hand, multicore processors while more programmable, fail to satisfy performance requirements. Graph-ics processing units (GPUs) overcome the above limitations. However, traditional GPUs are power-hungry and can consume as much as 350 watts, making them ill-suited for many SDR environments, particularly those that are battery-powered. Here we explore the viability of low-power mobile graphics processors to simultaneously overcome the limitations of per-formance, flexibility, and power. Via execution profiling and performance analysis, we identify major bottlenecks in mapping the wideband channelization algorithm onto these devices and adopt several optimization techniques to achieve multiplicative speed-up over a multithreaded implementation. Overall, our approach delivers a speedup of up to 43-fold on the discrete AMD Radeon HD 6470M GPU and 27-fold on the integrated AMD Radeon HD 6480G GPU, when compared to a vectorized and multithreaded version running on the AMD A4-3300M CPU. Index Terms—polyphase filter banks; mobile GPU; wideband channelization; software-defined radio I.