Results 1  10
of
28
Optimizing pipelines for power and performance
 in International Symposium on Microarchitecture (MICRO35), Nov. 2002. Selected as one of the four Best IBM Research Papers in Computer Science, Electrical Engineering and Math published in
, 2002
"... During the concept phase and definition of next generation highend processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPIcentric view alone in earlystage definition studies. One of the fundamental issues co ..."
Abstract

Cited by 41 (3 self)
 Add to MetaCart
During the concept phase and definition of next generation highend processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPIcentric view alone in earlystage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical powerperformance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the powermodeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models. 1
The SumAbsoluteDifference Motion Estimation Accelerator
 In Proceedings of the 24 th Euromicro Conference
"... In this paper we investigate the Sum Absolute Difference (SAD) operation, an operation frequently used by a number of algorithms for digital motion estimation. For such operation, we propose a single vector instruction that can be performed (in hardware) on an entire block of data in parallel. We in ..."
Abstract

Cited by 23 (15 self)
 Add to MetaCart
In this paper we investigate the Sum Absolute Difference (SAD) operation, an operation frequently used by a number of algorithms for digital motion estimation. For such operation, we propose a single vector instruction that can be performed (in hardware) on an entire block of data in parallel. We investigate possible implementations for such an instruction. Assuming a machine cycle comparable to the cycle of a two cycle multiply, we show that for a block of 16x1 or 16x16, the SAD operation can be performed in 3 or 4 machine cycles respectively. The proposed implementation operates as follows: first we determine in parallel which of the operands is the smallest in a pair of operands. Second we compute the absolute value of the difference of each pairs by subtracting the smallest value from the largest and finally we compute the accumulation. The operations associated with the second and the third step are performed in parallel resulting in a multiply (accumulate) type of operation. Our approach covers also the Mean Absolute Difference (MAD) operation at the exclusion of a shifting (division) operation.
Integrated analysis of power and performance for pipelined microprocessors
 IEEE Transactions on Computers
, 2004
"... been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be ..."
Abstract

Cited by 20 (8 self)
 Add to MetaCart
been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center, P.
Reduced Power Dissipation Through Truncated Multiplication
 in IEEE Alessandro Volta Memorial Workshop on Low Power Design
, 1999
"... Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be signi ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be significantly reduced by a technique known as truncated multiplication. With this technique, the least significant columns of the multiplication matrix are not used. Instead, the carries generated by these columns are estimated. This estimate is added with the most significant columns to produce the rounded product. This paper presents the design and implementation of parallel truncated multipliers. Simulations indicate that truncated parallel multipliers dissipate between 29 and 40 percent less power than standard parallel multipliers for operand sizes of 16 and 32 bits. 1: Introduction Highspeed parallel multipliers are fundamental building blocks in digital signal processing systems [1]. In...
Subnanosecond Arithmetic
, 1991
"... The Stanford Nanosecond Arithmetic Project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algor ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
The Stanford Nanosecond Arithmetic Project is targeted at realizing an arithmetic processor with performance approximately an order of magnitude faster than currently available technology. The realization of SNAP is predicated on an interdisciplinary approach and effort spanning research in algorithms, data representation, CAD, circuits and devices, and packaging. SNAP is visualized as an arithmetic coprocessor implemented on an active substrate containing several chips, each of which realize a particular arithmetic function. This year's report highlights recent results in the area of wave pipelining. We have fabricated a number of prototype die, implementing a multiplier slice. Cycle times below 5 ns were realized.
A Compact HighSpeed (31,5) Parallel Counter Circuit Based on Capacitive ThresholdLogic Gates
 IEEE Journal of SolidState Circuits
, 1996
"... A novel highspeed circuit implementation of the (31,5)parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
A novel highspeed circuit implementation of the (31,5)parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth of two. The chargebased CTL gates are essentially dynamic circuits which require a periodic refresh or precharge cycle, but unlike conventional dynamic CMOS gates, the circuit can be operated in synchronous as well as in asynchronous mode. The counter circuit is implemented using conventional 1.2 ¯m doublepoly CMOS technology, and it occupies a silicon area of about 0.08 mm 2 : Extensive postlayout simulations indicate that the circuit has a typical inputtooutput propagation delay of less than 3 ns, and the test circuit is shown to operate reliably when consecutive 31b input vectors are applied at a rate of up to 16 Mvectors/s. With its demonstrated data processing capability of abou...
Automatic synthesis of compressor trees: reevaluating large counters
 Design Automation and Test in Europe (DATE ’07
"... Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing s ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing some local optimisations but hardly improving the overall structure of arithmetic components. Architectural optimisations have been often studied manually, and only in the case of very common building blocks such as fast adders and multiinput adders, adhoc techniques have been developed. A notable case is multiinput addition, which is the core of many circuits such as multipliers, etc. The most common technique to implement multiinput addition is using compressor trees, which are often composed of carrysave adders (based on (3: 2) counters, i.e., full adders). A large body of literature exists to implement compressor trees using large counters. However, all the large counters were built by using full and half adders recursively. In this paper we give some definite answers to issues related to the use of large counters. We present a general technique to implement large counters whose performance is much better than the ones composed of full and half adders. Also we show that it is not always useful to use larger optimised counters and sometimes a combination of various size counters gives the best performance. Our results show 15 % improvement in the critical path delay. In some cases even hardware area is reduced by using our counters. 1.
72 Counters and Multiplication with Threshold Logic
 IEEE 30 TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS
, 1996
"... In this paper we propose new, threshold logic based, 72 counters. In particular we show that 72 counters can be implemented with threshold logic gates in three levels of gates with explicit computation of the outputs. Consequently, we improve the delay by showing that 72 counters can be designed ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
In this paper we propose new, threshold logic based, 72 counters. In particular we show that 72 counters can be implemented with threshold logic gates in three levels of gates with explicit computation of the outputs. Consequently, we improve the delay by showing that 72 counters can be designed with two levels of gates and implicit computation of the sum. Further, we investigate multiplication schemes using such counters, in combination with Kautz's networks for symmetric Boolean functions. Using a 32X32 direct multiplication scheme based on 72 implicit output computation counters and the Kautz's networks we show that our scheme outperforms in terms of area requirements known proposals for multiplications using threshold logic.
Integer Multiplication with Overflow Detection or Saturation
 IEEE Transactions on Computers
, 2000
"... AbstractÐHighspeed multiplication is frequently used in generalpurpose and applicationspecific computer systems. These systems often support integer multiplication, where two nbit integers are multiplied to produce a 2nbit product. To prevent growth in word length, processors typically return t ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
AbstractÐHighspeed multiplication is frequently used in generalpurpose and applicationspecific computer systems. These systems often support integer multiplication, where two nbit integers are multiplied to produce a 2nbit product. To prevent growth in word length, processors typically return the n least significant bits of the product and a flag that indicates whether or not overflow has occurred. Alternatively, some processors saturate results that overflow to the most positive or most negative representable number. This paper presents efficient methods for performing unsigned or two's complement integer multiplication with overflow detection or saturation. These methods have significantly less area and delay than conventional methods for integer multiplication with overflow detection or saturation.
Design and Implementation of a 16 by 16 LowPower Two's Complement Multiplier
 in Proc. 2000 IEEE Int. Symp. Circuits and Systems
, 2000
"... This paper describes the design and implementation of a highspeed lowpower 16 by 16 two's complement parallel multiplier. The multiplier uses optimized radix4 Booth encoders to generate the partial products, and an array of strategically placed (3,2), (5,3), and (7,4) counters to reduce the parti ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This paper describes the design and implementation of a highspeed lowpower 16 by 16 two's complement parallel multiplier. The multiplier uses optimized radix4 Booth encoders to generate the partial products, and an array of strategically placed (3,2), (5,3), and (7,4) counters to reduce the partial products to sum and carry vectors. The more significant bits of the product are computed from left to right using a modified ErcegovacLang converter. An implementation of the multiplier in 0.25 m static CMOS technology has an area of 0.126 mm 2 , a measured delay of 4.39 ns, and a average power dissipation of 0.110 mW/MHz at 2.5 Volts and 100 ffi C. I.