Results 1 - 10
of
12
Floating Point Unit Generation and Evaluation for FPGAs
- Proc. IEEE Symp. on FPGAs for Custom Computing Machines
, 2003
"... With gate counts approaching ten million gates, FPGAsare quickly becoming suitable for major floating point computations. However, to date, few comprehensive tools toallow for floating point unit tradeoffs have been developed. Most commercial or academic floating point libraries pro-vide only a smal ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
With gate counts approaching ten million gates, FPGAsare quickly becoming suitable for major floating point computations. However, to date, few comprehensive tools toallow for floating point unit tradeoffs have been developed. Most commercial or academic floating point libraries pro-vide only a small number of floating point modules with fixed parameters of bit-width, area, and speed. With thislimitation, user designs must be modified to meet the available units.The balance between FPGA floating point unit resources
A Case Study in Formal Verification of Register-Transfer Logic with ACL2: The Floating Point Adder of the AMD Athlon
"... . As an alternative to commercial hardware description languages, AMD 1 has developed an RTL language for microprocessor designs that is simple enough to admit a clear semantic definition, providing a basis for formal verification. We describe a mechanical proof system for designs represented in t ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
. As an alternative to commercial hardware description languages, AMD 1 has developed an RTL language for microprocessor designs that is simple enough to admit a clear semantic definition, providing a basis for formal verification. We describe a mechanical proof system for designs represented in this language, consisting of a translator to the ACL2 logical programming language and a methodology for verifying properties of the resulting programs using the ACL2 prover. As an illustration, we present a proof of IEEE compliance of the floating-point adder of the AMD Athlon processor. 1 Introduction The formal hardware verification effort at AMD has emphasized theorem proving using ACL2 [3], and has focused on the elementary floating-point operations. One of the challenges of our earlier work was to construct accurate formal models of the targeted circuit designs. These included the division and square root operations of the AMD-K5 processor [4, 6], which were implemented in microcode, a...
Achieving Typical Delays in Synchronous Systems via Timing Error Toleration
, 2000
"... This paper introduces a hardware method of improving the performance of any synchronous digital system. We exploit the well-known observation that typical delays in synchronous systems are much less then the worst-case delays usually designed to, typically by factors of two or three or more. Our pro ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
This paper introduces a hardware method of improving the performance of any synchronous digital system. We exploit the well-known observation that typical delays in synchronous systems are much less then the worst-case delays usually designed to, typically by factors of two or three or more. Our proposed family of hardware solutions employs timing error toleration (TIMERRTOL) to take advantage of this characteristic. Briefly, TIMERRTOL works by operating the system at speeds corresponding to typical delays, detecting when timing errors occur, and then allocating more time for the signals to settle to their correct values. The reference paths in the circuitry operate at lower speeds so as to always exhibit correct values (worst-case delays). The nominal speedups of the solutions are the same as the ratio of worst-case to typical delays for the application system. The increases in cost and power dissipation are reasonable. We present the basic designs for a family of three solutions, and...
Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform
- EURASIP Journal on Signal Processing, Special Issue on Applied Implementation of DSP and Communication Systems
, 2002
"... To enable floating-point (FP) signal processing applications in low-power mobile devices, we propose lightweight floating-point arithmetic. It offers a wider range of precision/power/speed/area tradeoffs, but is wrapped in forms that hide the complexity of the underlying implementations from both mu ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
To enable floating-point (FP) signal processing applications in low-power mobile devices, we propose lightweight floating-point arithmetic. It offers a wider range of precision/power/speed/area tradeoffs, but is wrapped in forms that hide the complexity of the underlying implementations from both multimedia software designers and hardware designers. Libraries implemented in C and Verilog provide flexible and robust floating-point units with variable bit-width formats, multiple rounding modes and other features. This solution bridges the design gap between software and hardware, and accelerates the design cycle from algorithm to chip by avoiding the translation to fixed-point arithmetic. We demonstrate the effectiveness of the proposed scheme using the inverse discrete cosine transform (IDCT), in the context of video coding, as an example. Further, we implement lightweight floating-point IDCT into hardware and demonstrate the power and area reduction.
A comparison of three rounding algorithms for IEEE floating-point multiplication
, 1998
"... A new IEEE compliant floating-point rounding algorithm for computing the rounded product from a carry-save representation of the product is presented. The new rounding algorithm is compared with the rounding algorithms of Yu and Zyner [23] and of Quach et al. [18]. For each rounding algorithm, a log ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
A new IEEE compliant floating-point rounding algorithm for computing the rounded product from a carry-save representation of the product is presented. The new rounding algorithm is compared with the rounding algorithms of Yu and Zyner [23] and of Quach et al. [18]. For each rounding algorithm, a logical description and a block diagram is given and the latency is analyzed. We conclude that the new rounding algorithm is the fastest rounding algorithm, provided that an injection (which depends only on the rounding mode and the sign) can be added in during the reduction of the partial products into a carry-save encoded digit string. In double precision the latency of the new rounding algorithm is 12 logic levels compared to 14 logic levels in the algorithm of Quach et al., and 16 logic levels in the algorithm of Yu and Zyner. 1. Introduction Every modern microprocessor includes a floating-point (FP) multiplier that complies with the IEEE 754 Standard [9]. The latency of the FP multiplier...
Uniprocessor Performance Enhancement Through Adaptive Clock Frequency Control
, 2003
"... This paper proposes a Timing Error Avoidance technique (TEAtime) to realize typical delays using standard synchronous design methodologies. The extra cost is very small, while the performance gains are substantial. The technique is applicable to any synchronous digital system. Correct results ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
This paper proposes a Timing Error Avoidance technique (TEAtime) to realize typical delays using standard synchronous design methodologies. The extra cost is very small, while the performance gains are substantial. The technique is applicable to any synchronous digital system. Correct results are ensured if the design guidelines are followed. Neither the base cycle time or the cycle count are affected by TEAtime. It is also easy to modify current designs to take advantage of TEAtime. In order to demonstrate TEAtimes capabilities and correct operation, we implemented a simple CPU and memory on a Xilinx FPGA (Field Programmable Gate Array) and ran it under various operating conditions. Over a wide range of temperatures TEAtime demonstrated performance improvements of about 34% over the baseline machines worst case specified performance. TEAtime adapted automatically to changing conditions, always stabilizing to a steady operating clock frequency. The remainder of this paper is organized as follows. Related work is reviewed in Section II. In Section III the basic ideas of timing error avoidance are presented, using our test CPU as a case study. Our experimental methodology is described in Section IV, with the experimental results presented in Section V. We conclude in Section VI. II. RELATED WORK There has been prior work somewhat similar to ours, but nothing that encompasses all of the attributes of our technique, not to mention actually demonstrating its functioning and characteristics with a real prototype. The closest work we are aware of is [10]. In this work a microcontroller has been modified so that it can self-tune its clock for "maximum" frequency. It does this by periodically pausing computation for up to 68 cycle...
An IEEE compliant floating-point adder that conforms with the pipelined packet-forwarding paradigm
, 2000
"... This paper presents a floating point addition algorithm and adder pipeline design employing a packet forwarding pipeline paradigm. The packet forwarding format and the proposed algorithms constitute a new paradigm for handling data hazards in deeply pipelined oating point pipelines. The addition and ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper presents a floating point addition algorithm and adder pipeline design employing a packet forwarding pipeline paradigm. The packet forwarding format and the proposed algorithms constitute a new paradigm for handling data hazards in deeply pipelined oating point pipelines. The addition and rounding algorithms employ a four stage execution phase pipeline with each stage suitable for implementation in a short clock period, assuming about fteen logic levels per cycle. The first two cycles are related to addition proper and are the focus of this paper. The last two cycles perform the rounding and have been covered in a paper by Nielsen and Matula [8]. The addition algorithm accepts one operand in a standard binary oating point format at the start of cycle one. The second operand is represented in the packet forwarding oating point format, namely, it is divided into four parts: the sign bit, the exponent string, the principal part of the significand, and the carry-round packet. T...
Leading-One Prediction Scheme For Latency Improvement In Single Datapath Floating-Point Adders
- Proc. Int. Conf. on Computer Design
, 1998
"... This paper describes the design of a Leading--one Predictor (LOP) for floating-- point addition, with an exact determination of the shift amount required. Leading-- one prediction is a technique that reduces the latency of the operation by determining the position of the leading one in the adder out ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
This paper describes the design of a Leading--one Predictor (LOP) for floating-- point addition, with an exact determination of the shift amount required. Leading-- one prediction is a technique that reduces the latency of the operation by determining the position of the leading one in the adder output in parallel with the actual addition. Previous LOP proposals produce a shift amount which might be in error by one position, so that this error has to be corrected after the addition terminates, increasing the critical path. Our design incorporates a concurrent detection of this error so that the amount of shift is corrected before the actual shift, without increasing the latency. The scheme presented here is applicable to the common case of a single datapath floating-point addition in which the output of the adder is always positive. This latter property simplifies both the adder and the LOP but requires a comparison of the magnitudes before the addition. The scheme has been extended to...
Floating-Point Unit in standard cell design with 116 bit wide dataflow
- Proceedings of the 14th IEEE Symposium on Computer Arithmetic
, 1999
"... The floating-point unit of a S/390 CMOS microprocessor is described. It contains a 116 bit fraction dataflow for addition and subtraction and a 64 bit-wide multiplier. Besides the register array, there are no other dataflow macros used; it is fully designed with standard cell books and is placed fla ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The floating-point unit of a S/390 CMOS microprocessor is described. It contains a 116 bit fraction dataflow for addition and subtraction and a 64 bit-wide multiplier. Besides the register array, there are no other dataflow macros used; it is fully designed with standard cell books and is placed flat with a timing driven placement algorithm. This design method allows more 'irregular' structures than usually found in custom designs. An overview of the floating-point unit is given and some interesting design items are shown: a 120 bit-wide truecomplement adder with precounting of leading zero digits, a signed multiplier with bit-optimized Wallace tree, intensive forwarding in source equal target cases and the checking method. 1. Introduction This paper describes the floating-point unit of a high performance microprocessor, where the microprocessor is optimized for commercial workloads. There are two slightly different design points described herein. The first is the floating-point un...
Time and Area Optimization in Processor Architecture
- In Proceedings of ARCS'97
, 1997
"... For specified program behavior and clocking overhead, there is an optimum cycle time. This can be improved somewhat by using wave pipelining, but program unpredictability ultimately limits performance by restricting both cycle time and instruction level parallelism. Algorithm and application impleme ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
For specified program behavior and clocking overhead, there is an optimum cycle time. This can be improved somewhat by using wave pipelining, but program unpredictability ultimately limits performance by restricting both cycle time and instruction level parallelism. Algorithm and application implementation should be based on understanding of program behavior, CAD tools, and technology. System on a chip can be realized as die potential increases. This system die then consists of collecting a variety of functional implementations and chip. These include core processor, floating point unit signal processors, cache, message compression and encryption, etc. Functional implementations involve selecting particular algorithms so that total application execution time is minimized under the constraints of fixed die area. Underlying all improvements in processor architecture are fundamental notions of the optimum use of time and space. In silicon CMOS technologies, the notion of optimum cost-- p...

