Results 1  10
of
15
Floating Point Unit Generation and Evaluation for FPGAs
 Proc. IEEE Symp. on FPGAs for Custom Computing Machines
, 2003
"... With gate counts approaching ten million gates, FPGAsare quickly becoming suitable for major floating point computations. However, to date, few comprehensive tools toallow for floating point unit tradeoffs have been developed. Most commercial or academic floating point libraries provide only a smal ..."
Abstract

Cited by 26 (3 self)
 Add to MetaCart
With gate counts approaching ten million gates, FPGAsare quickly becoming suitable for major floating point computations. However, to date, few comprehensive tools toallow for floating point unit tradeoffs have been developed. Most commercial or academic floating point libraries provide only a small number of floating point modules with fixed parameters of bitwidth, area, and speed. With thislimitation, user designs must be modified to meet the available units.The balance between FPGA floating point unit resources
A Case Study in Formal Verification of RegisterTransfer Logic with ACL2: The Floating Point Adder of the AMD Athlon
"... . As an alternative to commercial hardware description languages, AMD 1 has developed an RTL language for microprocessor designs that is simple enough to admit a clear semantic definition, providing a basis for formal verification. We describe a mechanical proof system for designs represented in t ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
. As an alternative to commercial hardware description languages, AMD 1 has developed an RTL language for microprocessor designs that is simple enough to admit a clear semantic definition, providing a basis for formal verification. We describe a mechanical proof system for designs represented in this language, consisting of a translator to the ACL2 logical programming language and a methodology for verifying properties of the resulting programs using the ACL2 prover. As an illustration, we present a proof of IEEE compliance of the floatingpoint adder of the AMD Athlon processor. 1 Introduction The formal hardware verification effort at AMD has emphasized theorem proving using ACL2 [3], and has focused on the elementary floatingpoint operations. One of the challenges of our earlier work was to construct accurate formal models of the targeted circuit designs. These included the division and square root operations of the AMDK5 processor [4, 6], which were implemented in microcode, a...
Achieving Typical Delays in Synchronous Systems via Timing Error Toleration
, 2000
"... This paper introduces a hardware method of improving the performance of any synchronous digital system. We exploit the wellknown observation that typical delays in synchronous systems are much less then the worstcase delays usually designed to, typically by factors of two or three or more. Our pro ..."
Abstract

Cited by 20 (6 self)
 Add to MetaCart
This paper introduces a hardware method of improving the performance of any synchronous digital system. We exploit the wellknown observation that typical delays in synchronous systems are much less then the worstcase delays usually designed to, typically by factors of two or three or more. Our proposed family of hardware solutions employs timing error toleration (TIMERRTOL) to take advantage of this characteristic. Briefly, TIMERRTOL works by operating the system at speeds corresponding to typical delays, detecting when timing errors occur, and then allocating more time for the signals to settle to their correct values. The reference paths in the circuitry operate at lower speeds so as to always exhibit correct values (worstcase delays). The nominal speedups of the solutions are the same as the ratio of worstcase to typical delays for the application system. The increases in cost and power dissipation are reasonable. We present the basic designs for a family of three solutions, and...
Uniprocessor Performance Enhancement Through Adaptive Clock Frequency Control
, 2003
"... This paper proposes a Timing Error Avoidance technique (TEAtime) to realize typical delays using standard synchronous design methodologies. The extra cost is very small, while the performance gains are substantial. The technique is applicable to any synchronous digital system. Correct results ..."
Abstract

Cited by 15 (6 self)
 Add to MetaCart
This paper proposes a Timing Error Avoidance technique (TEAtime) to realize typical delays using standard synchronous design methodologies. The extra cost is very small, while the performance gains are substantial. The technique is applicable to any synchronous digital system. Correct results are ensured if the design guidelines are followed. Neither the base cycle time or the cycle count are affected by TEAtime. It is also easy to modify current designs to take advantage of TEAtime. In order to demonstrate TEAtimes capabilities and correct operation, we implemented a simple CPU and memory on a Xilinx FPGA (Field Programmable Gate Array) and ran it under various operating conditions. Over a wide range of temperatures TEAtime demonstrated performance improvements of about 34% over the baseline machines worst case specified performance. TEAtime adapted automatically to changing conditions, always stabilizing to a steady operating clock frequency. The remainder of this paper is organized as follows. Related work is reviewed in Section II. In Section III the basic ideas of timing error avoidance are presented, using our test CPU as a case study. Our experimental methodology is described in Section IV, with the experimental results presented in Section V. We conclude in Section VI. II. RELATED WORK There has been prior work somewhat similar to ours, but nothing that encompasses all of the attributes of our technique, not to mention actually demonstrating its functioning and characteristics with a real prototype. The closest work we are aware of is [10]. In this work a microcontroller has been modified so that it can selftune its clock for "maximum" frequency. It does this by periodically pausing computation for up to 68 cycle...
Lightweight FloatingPoint Arithmetic: Case Study of Inverse Discrete Cosine Transform
 EURASIP Journal on Signal Processing, Special Issue on Applied Implementation of DSP and Communication Systems
, 2002
"... To enable floatingpoint (FP) signal processing applications in lowpower mobile devices, we propose lightweight floatingpoint arithmetic. It offers a wider range of precision/power/speed/area tradeoffs, but is wrapped in forms that hide the complexity of the underlying implementations from both mu ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
To enable floatingpoint (FP) signal processing applications in lowpower mobile devices, we propose lightweight floatingpoint arithmetic. It offers a wider range of precision/power/speed/area tradeoffs, but is wrapped in forms that hide the complexity of the underlying implementations from both multimedia software designers and hardware designers. Libraries implemented in C and Verilog provide flexible and robust floatingpoint units with variable bitwidth formats, multiple rounding modes and other features. This solution bridges the design gap between software and hardware, and accelerates the design cycle from algorithm to chip by avoiding the translation to fixedpoint arithmetic. We demonstrate the effectiveness of the proposed scheme using the inverse discrete cosine transform (IDCT), in the context of video coding, as an example. Further, we implement lightweight floatingpoint IDCT into hardware and demonstrate the power and area reduction.
A comparison of three rounding algorithms for IEEE floatingpoint multiplication
, 1998
"... A new IEEE compliant floatingpoint rounding algorithm for computing the rounded product from a carrysave representation of the product is presented. The new rounding algorithm is compared with the rounding algorithms of Yu and Zyner [23] and of Quach et al. [18]. For each rounding algorithm, a log ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
A new IEEE compliant floatingpoint rounding algorithm for computing the rounded product from a carrysave representation of the product is presented. The new rounding algorithm is compared with the rounding algorithms of Yu and Zyner [23] and of Quach et al. [18]. For each rounding algorithm, a logical description and a block diagram is given and the latency is analyzed. We conclude that the new rounding algorithm is the fastest rounding algorithm, provided that an injection (which depends only on the rounding mode and the sign) can be added in during the reduction of the partial products into a carrysave encoded digit string. In double precision the latency of the new rounding algorithm is 12 logic levels compared to 14 logic levels in the algorithm of Quach et al., and 16 logic levels in the algorithm of Yu and Zyner. 1. Introduction Every modern microprocessor includes a floatingpoint (FP) multiplier that complies with the IEEE 754 Standard [9]. The latency of the FP multiplier...
An IEEE compliant floatingpoint adder that conforms with the pipelined packetforwarding paradigm
, 2000
"... This paper presents a floating point addition algorithm and adder pipeline design employing a packet forwarding pipeline paradigm. The packet forwarding format and the proposed algorithms constitute a new paradigm for handling data hazards in deeply pipelined oating point pipelines. The addition and ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
This paper presents a floating point addition algorithm and adder pipeline design employing a packet forwarding pipeline paradigm. The packet forwarding format and the proposed algorithms constitute a new paradigm for handling data hazards in deeply pipelined oating point pipelines. The addition and rounding algorithms employ a four stage execution phase pipeline with each stage suitable for implementation in a short clock period, assuming about fteen logic levels per cycle. The first two cycles are related to addition proper and are the focus of this paper. The last two cycles perform the rounding and have been covered in a paper by Nielsen and Matula [8]. The addition algorithm accepts one operand in a standard binary oating point format at the start of cycle one. The second operand is represented in the packet forwarding oating point format, namely, it is divided into four parts: the sign bit, the exponent string, the principal part of the significand, and the carryround packet. T...
LeadingOne Prediction Scheme For Latency Improvement In Single Datapath FloatingPoint Adders
 Proc. Int. Conf. on Computer Design
, 1998
"... This paper describes the design of a Leadingone Predictor (LOP) for floating point addition, with an exact determination of the shift amount required. Leading one prediction is a technique that reduces the latency of the operation by determining the position of the leading one in the adder out ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
This paper describes the design of a Leadingone Predictor (LOP) for floating point addition, with an exact determination of the shift amount required. Leading one prediction is a technique that reduces the latency of the operation by determining the position of the leading one in the adder output in parallel with the actual addition. Previous LOP proposals produce a shift amount which might be in error by one position, so that this error has to be corrected after the addition terminates, increasing the critical path. Our design incorporates a concurrent detection of this error so that the amount of shift is corrected before the actual shift, without increasing the latency. The scheme presented here is applicable to the common case of a single datapath floatingpoint addition in which the output of the adder is always positive. This latter property simplifies both the adder and the LOP but requires a comparison of the magnitudes before the addition. The scheme has been extended to...
An operandoptimized asynchronous IEEE 754 doubleprecision floatingpoint adder
 In Proceedings of IEEE International Symposium on Asynchronous Circuits and Systems
, 2010
"... Abstract—We present the design and implementation of an asynchronous highperformance IEEE 754 compliant doubleprecision floatingpoint adder (FPA). We provide a detailed breakdown of the power consumption of the FPA datapath, and use it to motivate a number of different datadependent optimizations ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract—We present the design and implementation of an asynchronous highperformance IEEE 754 compliant doubleprecision floatingpoint adder (FPA). We provide a detailed breakdown of the power consumption of the FPA datapath, and use it to motivate a number of different datadependent optimizations for energyefficiency. Our baseline asynchronous FPA has a throughput of 2.15 GHz while consuming 69.3 pJ per operation in a 65nm bulk process. For the same set of nonzero operands, our optimizations improve the FPA’s energyefficiency to 30.2 pJ per operation while preserving average throughput, a 56.7 % reduction in energy relative to the baseline design. To our knowledge, this is the first detailed design of a highperformance asynchronous doubleprecision floatingpoint adder. KeywordsFloating point arithmetic; asynchronous logic circuits; verylargescale integration; pipeline processing I.
Reduced Latency IEEE FloatingPoint Standard Adder Architectures
"... The design and implementation of a double precision floatingpoint IEEE754 standard adder is described which uses “flagged prefix addition ” to merge rounding with the significand addition. The floatingpoint adder is implemented in 0:5 m CMOS, measures 1:8mm2, has a 3cycle latency and implements ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The design and implementation of a double precision floatingpoint IEEE754 standard adder is described which uses “flagged prefix addition ” to merge rounding with the significand addition. The floatingpoint adder is implemented in 0:5 m CMOS, measures 1:8mm2, has a 3cycle latency and implements all rounding modes. A modified version of this floatingpoint adder can perform accumulation in 2cycles with a small amount of extra hardware for use in a parallel processor node. This is achieved by feeding back the previous unnormalised but correctly rounded result together with the normalisation distance. A 2cycle latency floatingpoint adder architecture with potentially the same cycle time that also employs flagged prefix addition is described. It also incorporates a fast prediction scheme for the true subtraction of significands with an exponent difference of 1, with one less adder. Key Words: floatingpoint, adder, arithmetic, VLSI. 1.