Results 1  10
of
46
Optimizing pipelines for power and performance
 in International Symposium on Microarchitecture (MICRO35), Nov. 2002. Selected as one of the four Best IBM Research Papers in Computer Science, Electrical Engineering and Math published in
, 2002
"... During the concept phase and definition of next generation highend processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPIcentric view alone in earlystage definition studies. One of the fundamental issues co ..."
Abstract

Cited by 46 (4 self)
 Add to MetaCart
(Show Context)
During the concept phase and definition of next generation highend processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPIcentric view alone in earlystage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical powerperformance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the powermodeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models. 1
The SumAbsoluteDifference Motion Estimation Accelerator
 In Proceedings of the 24 th Euromicro Conference
"... In this paper we investigate the Sum Absolute Difference (SAD) operation, an operation frequently used by a number of algorithms for digital motion estimation. For such operation, we propose a single vector instruction that can be performed (in hardware) on an entire block of data in parallel. We in ..."
Abstract

Cited by 28 (15 self)
 Add to MetaCart
In this paper we investigate the Sum Absolute Difference (SAD) operation, an operation frequently used by a number of algorithms for digital motion estimation. For such operation, we propose a single vector instruction that can be performed (in hardware) on an entire block of data in parallel. We investigate possible implementations for such an instruction. Assuming a machine cycle comparable to the cycle of a two cycle multiply, we show that for a block of 16x1 or 16x16, the SAD operation can be performed in 3 or 4 machine cycles respectively. The proposed implementation operates as follows: first we determine in parallel which of the operands is the smallest in a pair of operands. Second we compute the absolute value of the difference of each pairs by subtracting the smallest value from the largest and finally we compute the accumulation. The operations associated with the second and the third step are performed in parallel resulting in a multiply (accumulate) type of operation. Our approach covers also the Mean Absolute Difference (MAD) operation at the exclusion of a shifting (division) operation.
Reduced Power Dissipation Through Truncated Multiplication
 in IEEE Alessandro Volta Memorial Workshop on Low Power Design
, 1999
"... Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be signi ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
(Show Context)
Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be significantly reduced by a technique known as truncated multiplication. With this technique, the least significant columns of the multiplication matrix are not used. Instead, the carries generated by these columns are estimated. This estimate is added with the most significant columns to produce the rounded product. This paper presents the design and implementation of parallel truncated multipliers. Simulations indicate that truncated parallel multipliers dissipate between 29 and 40 percent less power than standard parallel multipliers for operand sizes of 16 and 32 bits. 1: Introduction Highspeed parallel multipliers are fundamental building blocks in digital signal processing systems [1]. In...
Integrated analysis of power and performance for pipelined microprocessors
 IEEE Transactions on Computers
, 2004
"... ..."
(Show Context)
Subnanosecond Arithmetic
, 1991
"... The Stanford Nanosecond Arithmetic Project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algor ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
The Stanford Nanosecond Arithmetic Project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algorithms, data representation, CAD, circuits and devices, and packaging. SNAP is visualized as an arithmetic coprocessor implemented on an active substrate containing several chips, each of which realize a particular arithmetic function. This year's report highlights recent results in the area of wave pipelining. We have fabricated a number of prototype die, implementing a multiplier slice. Cycle times below 5 ns were realized.
Automatic synthesis of compressor trees: reevaluating large counters
 Design Automation and Test in Europe (DATE ’07
"... Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing s ..."
Abstract

Cited by 11 (9 self)
 Add to MetaCart
(Show Context)
Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing some local optimisations but hardly improving the overall structure of arithmetic components. Architectural optimisations have been often studied manually, and only in the case of very common building blocks such as fast adders and multiinput adders, adhoc techniques have been developed. A notable case is multiinput addition, which is the core of many circuits such as multipliers, etc. The most common technique to implement multiinput addition is using compressor trees, which are often composed of carrysave adders (based on (3: 2) counters, i.e., full adders). A large body of literature exists to implement compressor trees using large counters. However, all the large counters were built by using full and half adders recursively. In this paper we give some definite answers to issues related to the use of large counters. We present a general technique to implement large counters whose performance is much better than the ones composed of full and half adders. Also we show that it is not always useful to use larger optimised counters and sometimes a combination of various size counters gives the best performance. Our results show 15 % improvement in the critical path delay. In some cases even hardware area is reduced by using our counters. 1.
A Compact HighSpeed (31,5) Parallel Counter Circuit Based on Capacitive ThresholdLogic Gates
 IEEE Journal of SolidState Circuits
, 1996
"... A novel highspeed circuit implementation of the (31,5)parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
A novel highspeed circuit implementation of the (31,5)parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth of two. The chargebased CTL gates are essentially dynamic circuits which require a periodic refresh or precharge cycle, but unlike conventional dynamic CMOS gates, the circuit can be operated in synchronous as well as in asynchronous mode. The counter circuit is implemented using conventional 1.2 ¯m doublepoly CMOS technology, and it occupies a silicon area of about 0.08 mm 2 : Extensive postlayout simulations indicate that the circuit has a typical inputtooutput propagation delay of less than 3 ns, and the test circuit is shown to operate reliably when consecutive 31b input vectors are applied at a rate of up to 16 Mvectors/s. With its demonstrated data processing capability of abou...
Minimizing Energy Dissipation in HighSpeed Multipliers
 Proc. IEEE Symp. on Low Power Electronics and Design
, 1997
"... This paper presents a new twogatedelay implementation of the Booth encoder and partial product generator, which eliminates the unnecessary glitches associated with the Booth multiplier. In addition, a modified signed/unsigned (MSU) and modified signgenerate (MSG) algorithms, suitable especially f ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
This paper presents a new twogatedelay implementation of the Booth encoder and partial product generator, which eliminates the unnecessary glitches associated with the Booth multiplier. In addition, a modified signed/unsigned (MSU) and modified signgenerate (MSG) algorithms, suitable especially for signed/unsigned multipliers, were developed in order to reduce the compression level needed in the Wallace tree, and hence reduce the multiplier hardware. Using these features reduces the multiplier array energy dissipation by about 30 % and increases speed by about 10%. 1.
Design and Implementation of a 16 by 16 LowPower Two's Complement Multiplier
 in Proc. 2000 IEEE Int. Symp. Circuits and Systems
, 2000
"... This paper describes the design and implementation of a highspeed lowpower 16 by 16 two's complement parallel multiplier. The multiplier uses optimized radix4 Booth encoders to generate the partial products, and an array of strategically placed (3,2), (5,3), and (7,4) counters to reduce the ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
This paper describes the design and implementation of a highspeed lowpower 16 by 16 two's complement parallel multiplier. The multiplier uses optimized radix4 Booth encoders to generate the partial products, and an array of strategically placed (3,2), (5,3), and (7,4) counters to reduce the partial products to sum and carry vectors. The more significant bits of the product are computed from left to right using a modified ErcegovacLang converter. An implementation of the multiplier in 0.25 m static CMOS technology has an area of 0.126 mm 2 , a measured delay of 4.39 ns, and a average power dissipation of 0.110 mW/MHz at 2.5 Volts and 100 ffi C. I.
R.W; “Integer multiplication with overflow detection or saturation
 Issue 7, July 2000 Page(s):681  691
"... ..."