Results 1  10
of
17
Reduced Power Dissipation Through Truncated Multiplication
 in IEEE Alessandro Volta Memorial Workshop on Low Power Design
, 1999
"... Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be signi ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
Reducing the power dissipation of parallel multipliers is important in the design of digital signal processing systems. In many of these systems, the products of parallel multipliers are rounded to avoid growth in word size. The power dissipation and area of rounded parallel multipliers can be significantly reduced by a technique known as truncated multiplication. With this technique, the least significant columns of the multiplication matrix are not used. Instead, the carries generated by these columns are estimated. This estimate is added with the most significant columns to produce the rounded product. This paper presents the design and implementation of parallel truncated multipliers. Simulations indicate that truncated parallel multipliers dissipate between 29 and 40 percent less power than standard parallel multipliers for operand sizes of 16 and 32 bits. 1: Introduction Highspeed parallel multipliers are fundamental building blocks in digital signal processing systems [1]. In...
BitSerial Multipliers and Squarers
 IEEE Transactions on Computers
, 1994
"... Reprinted from IEEE Transactions on Computers, 43(12):14451450, December 1994. Copyright c fl 1994 by IEEE. Traditional bitserial multipliers present one or more clock cycles of datalatency. In some situations, it is desirable to obtain the output after only a combinational delay, as in serial ad ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
Reprinted from IEEE Transactions on Computers, 43(12):14451450, December 1994. Copyright c fl 1994 by IEEE. Traditional bitserial multipliers present one or more clock cycles of datalatency. In some situations, it is desirable to obtain the output after only a combinational delay, as in serial adders and subtracters. A serial multiplier and a squarer with no latency cycles are presented here. Both accept unsigned or signextended two's complement numbers and produce an arbitrarily long output. They are fully modular and thus good candidates for introduction in VLSI libraries. 1 Introduction Bitserial arithmetic is often used in parallel systems with high connectivity to reduce the wiring down to a reasonable level. When multiplications are required, a typical choice is a serialparallel multiplier. In this device, one factor is stored in parallel, while the other is entered serially. However, this scheme is not always possible, as, for instance, when both factors are input seriall...
A Compact HighSpeed (31,5) Parallel Counter Circuit Based on Capacitive ThresholdLogic Gates
 IEEE Journal of SolidState Circuits
, 1996
"... A novel highspeed circuit implementation of the (31,5)parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
A novel highspeed circuit implementation of the (31,5)parallel counter (i.e., population counter) based on capacitive threshold logic (CTL) is presented. The circuit consists of 20 threshold logic gates arranged in two stages, i.e., the parallel counter described here has an effective logic depth of two. The chargebased CTL gates are essentially dynamic circuits which require a periodic refresh or precharge cycle, but unlike conventional dynamic CMOS gates, the circuit can be operated in synchronous as well as in asynchronous mode. The counter circuit is implemented using conventional 1.2 ¯m doublepoly CMOS technology, and it occupies a silicon area of about 0.08 mm 2 : Extensive postlayout simulations indicate that the circuit has a typical inputtooutput propagation delay of less than 3 ns, and the test circuit is shown to operate reliably when consecutive 31b input vectors are applied at a rate of up to 16 Mvectors/s. With its demonstrated data processing capability of abou...
Automatic synthesis of compressor trees: reevaluating large counters
 Design Automation and Test in Europe (DATE ’07
"... Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing s ..."
Abstract

Cited by 8 (7 self)
 Add to MetaCart
Despite the progress of the last decades in electronic design automation, arithmetic circuits have always received way less attention than other classes of digital circuits. Logic synthesisers, which play a fundamental role in design today, play a minor role on most arithmetic circuits, performing some local optimisations but hardly improving the overall structure of arithmetic components. Architectural optimisations have been often studied manually, and only in the case of very common building blocks such as fast adders and multiinput adders, adhoc techniques have been developed. A notable case is multiinput addition, which is the core of many circuits such as multipliers, etc. The most common technique to implement multiinput addition is using compressor trees, which are often composed of carrysave adders (based on (3: 2) counters, i.e., full adders). A large body of literature exists to implement compressor trees using large counters. However, all the large counters were built by using full and half adders recursively. In this paper we give some definite answers to issues related to the use of large counters. We present a general technique to implement large counters whose performance is much better than the ones composed of full and half adders. Also we show that it is not always useful to use larger optimised counters and sometimes a combination of various size counters gives the best performance. Our results show 15 % improvement in the critical path delay. In some cases even hardware area is reduced by using our counters. 1.
Design and Implementation of a 16 by 16 LowPower Two's Complement Multiplier
 in Proc. 2000 IEEE Int. Symp. Circuits and Systems
, 2000
"... This paper describes the design and implementation of a highspeed lowpower 16 by 16 two's complement parallel multiplier. The multiplier uses optimized radix4 Booth encoders to generate the partial products, and an array of strategically placed (3,2), (5,3), and (7,4) counters to reduce the parti ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
This paper describes the design and implementation of a highspeed lowpower 16 by 16 two's complement parallel multiplier. The multiplier uses optimized radix4 Booth encoders to generate the partial products, and an array of strategically placed (3,2), (5,3), and (7,4) counters to reduce the partial products to sum and carry vectors. The more significant bits of the product are computed from left to right using a modified ErcegovacLang converter. An implementation of the multiplier in 0.25 m static CMOS technology has an area of 0.126 mm 2 , a measured delay of 4.39 ns, and a average power dissipation of 0.110 mW/MHz at 2.5 Volts and 100 ffi C. I.
Efficient Hamming Weight Comparators for Binary Vectors Based on Accumulative and Up/Down Parallel Counters
"... Abstract—New countingbased methods for comparing the Hamming weight of a binary vector with a constant, as well as comparing the Hamming weights of two input vectors, are proposed. It is shown that the proposed comparators are faster and simpler, both in asymptotic sense and for moderate vector len ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Abstract—New countingbased methods for comparing the Hamming weight of a binary vector with a constant, as well as comparing the Hamming weights of two input vectors, are proposed. It is shown that the proposed comparators are faster and simpler, both in asymptotic sense and for moderate vector lengths, compared with the best available fully digital designs. These speed and cost advantages result from a more efficient population counting, as well as the merger of counting and comparison operations, via accumulative and up/down parallel counters. Index Terms—Column compression, comparator, Hamming distance, multioperand addition, parallel counter, population count. I.
DepthEfficient Threshold Circuits for Multiplication and Symmetric Function Computation
 Proc. Int ‘1 Computing and Combinatorics Con&, LNCS
, 1996
"... . The multiplication operation and the computation of symmetric functions are fundamental problems in arithmetic and algebraic computation. We describe unitweight threshold circuits to perform the multiplication of two nbit integers, which have fanin k, edge complexity O(n 2+1=d ), and depth O ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
. The multiplication operation and the computation of symmetric functions are fundamental problems in arithmetic and algebraic computation. We describe unitweight threshold circuits to perform the multiplication of two nbit integers, which have fanin k, edge complexity O(n 2+1=d ), and depth O(log d + log n= log k), for any fixed integer d ? 0. For a given fanin, our constructions have considerably smaller depth (or edge complexity) than the best previous circuits of similar edge complexity (or depth, respectively). Similar results are also shown for the iterated addition operation and the computation of symmetric functions. In particular, we propose a unitweight threshold circuit to compute the sum of m nbit numbers that has fanin k, edge complexity O(nm 1+1=d ), and depth O(log d + log m= log k + log n= log k). 1 Introduction The delay required to perform multiplication and iterated addition is crucial to the performance of many computationally intensive applications. T...
Efficient designs for multiinput counters
 Proc. 33th Asilomar Conf. Signals, Systems, and Computers
, 1999
"... A multiinput counter; or accumulative parallel counter; represents a true generalization of a sequential counter in that it incorporates the memory feature of an ordinary counter; i.e., it adds the sum of its inputs to a stored value. In this paper; we present eficient designs for simple multiinpu ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
A multiinput counter; or accumulative parallel counter; represents a true generalization of a sequential counter in that it incorporates the memory feature of an ordinary counter; i.e., it adds the sum of its inputs to a stored value. In this paper; we present eficient designs for simple multiinput counters and their modular versions, which keep the accumulated count modulo an arbitrary constant. 1 Introdiction In its simplest form, a counter is a sequential circuit that stores an integer value and can increment (and, in the case of up/down counters, also decrement) it by 1 upon the receipt of a special “enable ” or “count ” signal. Counters are among the most widely used components in digital systems, with applications in computer systems, communication equipment, scientific instruments, and industrial process control, to name a few. A vast variety of counter designs have been proposed in the literature [3,6,12], patented [ 1,2,4], and/or used in practice. Although techniques are available for designing highspeed counters with conventional number representations [S], thespeeds that can be achieved are limited by the requirement for carry propagation. Even with advanced designs based on redundant number representations [6] or hierarchical incrementation using small blocks at the right end and increasingly wider blocks toward the left [ 11, 151, speeds will be limited by the basic switching characteristics of components and, perhaps, by the need for final conversion of the redundant count into a conventional binary number. Even when no conversion is needed, to count the number of Is among many thousands of bits by feeding them sequentially to a highspeed counter would imply delays well in excess of several microseconds and such a delay is unacceptable in certain applications. Some form of parallelism in handling the large number of input bits is needed to achieve higher speeds. In fact, there are indications that the reverse of the process discussed above may be viable in counter design: In0780357000/99/$10.00©1999 IEEE
Optimaldepth threshold circuits for multiplication and related problems
 Proc. 33th Asilomar Conf. Signals, Systems, and Computers
, 1999
"... Multiplication is one of the most fundamental operations in arithmetic and algebraic computations. In this paper, we present depthoptimal circuits for performing multiplication, multioperand addition, and symmetric function evaluation with small size and restricted fanin. In particular, we show th ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Multiplication is one of the most fundamental operations in arithmetic and algebraic computations. In this paper, we present depthoptimal circuits for performing multiplication, multioperand addition, and symmetric function evaluation with small size and restricted fanin. In particular, we show that the product of two nbit numbers can be computed using a unitweight threshold circuit of fanink, depth 3 logk n + log2 d log2(1+&)1 lo&) + O(l), and edge complexity O(n2f’/d log(d + l)), for any integer d> 0. All the circuits proposed in this paper have constant depth when logkn is a constant and are depthoptimal within small constant factors for any fanin k. 1
The amalgam compiler infrastructure
, 2004
"... To my mother and father, for their love, support, and guidance iii ACKNOWLEDGMENTS First and foremost, I would like to thank my adviser, Professor Nick Carter, for his guidance and support during this endeavor. Without his encouragement and ability to help me step back and look at the big picture, t ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
To my mother and father, for their love, support, and guidance iii ACKNOWLEDGMENTS First and foremost, I would like to thank my adviser, Professor Nick Carter, for his guidance and support during this endeavor. Without his encouragement and ability to help me step back and look at the big picture, this thesis would not have been possible. Next, I would like to thank Derek Gottlieb and Josh Walstrom for the numerous stimulating technical discussions and for the simulation infrastructure used to gather the results for this thesis. To both Lee Baugh and Brian Greskamp, thank you for the innumerable discussions on the intermediate program representation that were invaluable to this thesis. Thanks also to ChiWei Wang and Chris Grier for your contributions to the benchmark suite. Many thanks to my brother, Steve, who spent far too many hours reviewing the text of this thesis. I also wish to thank my parents. Your years of support and encouragement have not gone unnoticed. Lastly, to my love, Tara, thank you for your unconditional love and daily support.