Results 1 
4 of
4
ValueBased Clock Gating and Operation Packing: Dynamic Strategies for Improving Processor Power and Performance
 ACM Transactions on Computer Systems
, 2000
"... This article presents our observations demonstrating that operations on "narrowwidth" quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this d ..."
Abstract

Cited by 25 (0 self)
 Add to MetaCart
This article presents our observations demonstrating that operations on "narrowwidth" quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrowwidth operations. The first, poweroriented optimization reduces processor power consumption by using operandvaluebased clock gating to turn off portions of arithmetic units that will be unused by narrowwidth operations. This optimization results in a 45%60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%10% fullchip power savings. Our second, performanceoriented optimization improves processor performance by packing together narrowwidth operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%6.2% for SPECint95 and 8.0%10.4% for MediaBench. Overall, these optimizations highlight an increasing opportunity for valuebased optimizations to improve both power and performance in current microprocessors
Reduced Power Dissipation Through Truncated Multiplication
 in IEEE Alessandro Volta Memorial Workshop on Low Power Design
, 1999
"... Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be signi ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be significantly reduced by a technique known as truncated multiplication. With this technique, the least significant columns of the multiplication matrix are not used. Instead, the carries generated by these columns are estimated. This estimate is added with the most significant columns to produce the rounded product. This paper presents the design and implementation of parallel truncated multipliers. Simulations indicate that truncated parallel multipliers dissipate between 29 and 40 percent less power than standard parallel multipliers for operand sizes of 16 and 32 bits. 1: Introduction Highspeed parallel multipliers are fundamental building blocks in digital signal processing systems [1]. In...
Implementation of Low Power Digital Multipliers Using 10 Transistor Adder Blocks
, 2005
"... The increasing demand for the high fidelity portable devices has laid emphasis on the development of low power and high performance systems. In the next generation processors, the low power design has to be incorporated into fundamental computation units, such as multipliers. The characterization an ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The increasing demand for the high fidelity portable devices has laid emphasis on the development of low power and high performance systems. In the next generation processors, the low power design has to be incorporated into fundamental computation units, such as multipliers. The characterization and optimization of such low power multipliers will aid in comparison and choice of multiplier modules in system design. In this paper we performed a comparative analysis of the power, delay, and power delay product (PDP) optimization characteristics of four parallel digital multipliers implemented using low power 10 transistor (10T) adders and conventional CMOS adder cells. In order to achieve optimal power savings at smaller geometry sizes, we proposed a heuristic approach known as hybrid adder models. Multipliers realized using the Static Energy Recovery Full adder (SERF) circuit consumed considerably less power compared to 10T and static CMOS based multipliers for all the configurations studied. Furthermore, the difference between the power consumption of the 10 transistor based multipliers and 28T multipliers is significant at 180 nm, but not at 70 nm. For smaller geometry sizes down to 70 nm, the propagation delay of the multipliers implemented with 10 transistors translates to a better performance measure. CarrySave Multipliers had better PDP range than the other multipliers for all the three adder submodule designs. The
Arithmetic Circuits for EnergyPrecision Tradeoffs in Mobile Graphics Processing Units
"... Abstract — In mobile devices, limiting the Graphics Processing Unit’s (GPU’s) energy usage is of great importance to extending battery life. This work shows that significant energy savings can be obtained by reducing the precision of graphics computations, yet maintaining acceptable quality of the f ..."
Abstract
 Add to MetaCart
Abstract — In mobile devices, limiting the Graphics Processing Unit’s (GPU’s) energy usage is of great importance to extending battery life. This work shows that significant energy savings can be obtained by reducing the precision of graphics computations, yet maintaining acceptable quality of the final rendered image. In particular, we focus on a portion of a typical graphics processor pipeline—the vertex transformation stage—and evaluate the tradeoff between energy efficiency and image fidelity. We first develop circuitlevel designs of arithmetic components whose precision can be varied dynamically with finegrained power gating techniques. Spice simulation is used to characterize each component’s energy consumption, based on which a systemlevel energy model for the entire vertex stage is developed. We then use this energy model in conjunction with a graphics hardware simulator to determine the energy savings for real workloads. Results show that significant energy savings (>60%) can be obtained by lowering the arithmetic precision of this stage without causing any noticeable artifacts in the final image. Furthermore, our approach allows for even greater energy savings for only a modest loss of image quality. Thus, this work finds that the