Results 1  10
of
15
An Analysis Of Division Algorithms And Implementations
 IEEE Transactions on Computers
, 1995
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to al ..."
Abstract

Cited by 55 (8 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to all aspects of floatingpoint computation. Many algorithms are suitable for implementing division in hardware. This paper presents four major classes of algorithms in a unified framework, namely digit recurrence, functional iteration, very high radix, and variable latency. Digit recurrence algorithms, the most common of which is SRT, use subtraction as the fundamental operator, and they converge to a quotient linearly. Division by functional iteration converges to a quotient quadratically using multiplication. Very high radix division algorithms are similar to digit recurrence algorithms, but they incorporate multiplication to reduce the latency. Variable latency division algorithms reduce the...
Implementation of Single Precision Floating Point Square Root on FPGAs
 PROC. OF THE IEEE SYMPOSIUM ON FPGAS FOR CUSTOM COMPUTING MACHINES
, 1997
"... Square root operation is hard to implement on FPGAs because of the complexity of the algorithms. In this paper, we present a nonrestoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs. One is lowcost iterative ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
Square root operation is hard to implement on FPGAs because of the complexity of the algorithms. In this paper, we present a nonrestoring square root algorithm and two very simple single precision floating point square root implementations based on the algorithm on FPGAs. One is lowcost iterative implementation that uses a traditional adder/subtractor. The operation latency is 25 clock cycles and the issue rate is 24 clock cycles. The other is highthroughput pipelined implementation that uses multiple adder/subtractors. The operation latency is 15 clock cycles and the issue rate is one clock cycle. It means that the pipelined implementation is capable of accepting a square root instruction on every clock cycle.
Formal verification of square root algorithms
 Formal Methods in Systems Design
, 2003
"... Abstract. We discuss the formal verification of some lowlevel mathematical software for the Intel ® Itanium ® architecture. A number of important algorithms have been proven correct using the HOL Light theorem prover. After briefly surveying some of our formal verification work, we discuss in more ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Abstract. We discuss the formal verification of some lowlevel mathematical software for the Intel ® Itanium ® architecture. A number of important algorithms have been proven correct using the HOL Light theorem prover. After briefly surveying some of our formal verification work, we discuss in more detail the verification of a square root algorithm, which helps to illustrate why some features of HOL Light, in particular programmability, make it especially suitable for these applications. 1. Overview The Intel ® Itanium ® architecture is a new 64bit architecture jointly developed by Intel and HewlettPackard, implemented in the Itanium® processor family (IPF). Among the software supplied by Intel to support IPF processors are some optimized mathematical functions to supplement or replace less efficient generic libraries. Naturally, the correctness of the algorithms used in such software is always a major concern. This is particularly so for division, square root and certain transcendental function kernels, which are intimately tied to the basic architecture. First, in IA32 compatibility mode, these algorithms are used by hardware instructions like fptan and fdiv. And while in “native ” mode, division and square root are implemented in software, typical users are likely to see them as part of the basic architecture. The formal verification of some of the division algorithms is described by Harrison (2000b), and a representative verification of a transcendental function by Harrison (2000a). In this paper we complete the picture by considering a square root algorithm. Division, transcendental functions and square roots all have quite distinctive features and their formal verifications differ widely from each other. The present proofs have a number of interesting features, and show how important some theorem prover features — in particular programmability — are. The formal verifications are conducted using the freely available 1 HOL Light prover (Harrison, 1996). HOL Light is a version of HOL (Gordon and Melham, 1993), itself a descendent of Edinburgh LCF
Integer Division Using Reciprocals
 In Proceedings of the Tenth Symposium on Computer Arithmetic
, 1991
"... As logic density increases, more and more functionality is moving into hardware. Several years ago, it was uncommon to find more than minimal support in a processor for integer multiplication and division. Now, several processors have multipliers included within the central processing unit on one in ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
As logic density increases, more and more functionality is moving into hardware. Several years ago, it was uncommon to find more than minimal support in a processor for integer multiplication and division. Now, several processors have multipliers included within the central processing unit on one integrated circuit [8, 12]. Integer division, due to its iterative nature, benefits much less when implemented directly in hardware and is difficult to pipeline. By using a reciprocal approximation, integer division can be synthesized from a multiply followed by a shift. Without carefully selecting the reciprocal, however, the quotient obtained often suffers from offby one errors, requiring a correction step. This paper describes the design decisions we made when architecting integer division for a new 64 bit machine. The result is a fast and economical scheme for computing both unsigned and signed integer quotients that guarantees an exact answer without any correction. The reciprocal comput...
Faster floatingpoint square root for integer processors
"... Abstract — This paper presents some work in progress on fast and accurate floatingpoint arithmetic software for ST200based embedded systems. We show how to use some key architectural features to design codes that achieve correct roundingtonearest without sacrificing for efficiency. This is illus ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract — This paper presents some work in progress on fast and accurate floatingpoint arithmetic software for ST200based embedded systems. We show how to use some key architectural features to design codes that achieve correct roundingtonearest without sacrificing for efficiency. This is illustrated with the square root function, whose implementation given here is faster by over 35 % than the previously best one for such systems. I.
ABSTRACT Automating CustomPrecision Function Evaluation for Embedded Processors
"... Due to resource and power constraints, embedded processors often cannot afford dedicated floatingpoint units. For instance, the IBM PowerPC processor embedded in Xilinx VirtexII Pro FPGAs only supports emulated floatingpoint arithmetic, which leads to slow operation when floatingpoint arithmetic ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Due to resource and power constraints, embedded processors often cannot afford dedicated floatingpoint units. For instance, the IBM PowerPC processor embedded in Xilinx VirtexII Pro FPGAs only supports emulated floatingpoint arithmetic, which leads to slow operation when floatingpoint arithmetic is desired. This paper presents a customizable mathematical library using fixedpoint arithmetic for elementary function evaluation. We approximate functions via polynomial or rational approximations depending on the userdefined accuracy requirements. The data representation for the inputs and outputs are compatible with IEEE singleprecision and doubleprecision floatingpoint formats. Results show that our 32bit polynomial method achieves over 80 times speedup over the singleprecision mathematical library from Xilinx, while our 64bit polynomial method achieves over 30 times speedup.
Fast IEEE Rounding for Division by Functional Iteration
, 1996
"... A class of high performance division algorithms is functional iteration. Division by functional iteration uses multiplication as the fundamental operator. The main advantage of division by functional iteration is quadratic convergence to the quotient. However, unlike nonrestoring division algorithm ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
A class of high performance division algorithms is functional iteration. Division by functional iteration uses multiplication as the fundamental operator. The main advantage of division by functional iteration is quadratic convergence to the quotient. However, unlike nonrestoring division algorithms such as SRT division, functional iteration does not directly provide a final remainder. This makes fast and exact rounding difficult. This paper clarifies the methodology for correct IEEE compliant rounding for quadraticallyconverging division algorithms. It proposes an extension to previously reported techniques of using extended precision in the computation to reduce the frequency of back multiplications required to obtain the final remainder. Further, a technique applicable to all IEEE rounding modes is presented which replaces the final subtraction for remainder computation with very simple combinational logic. Key Words and Phrases: Division, Goldschmidt's algorithm, IEEE rounding, NewtonRaphson, variable latency
Floatingpoint verification
 International Journal Of ManMachine Studies
, 1995
"... Abstract: This paper overviews the application of formal verification techniques to hardware in general, and to floatingpoint hardware in particular. A specific challenge is to connect the usual mathematical view of continuous arithmetic operations with the discrete world, in a credible and verifia ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract: This paper overviews the application of formal verification techniques to hardware in general, and to floatingpoint hardware in particular. A specific challenge is to connect the usual mathematical view of continuous arithmetic operations with the discrete world, in a credible and verifiable way.
A new binary floatingpoint division algorithm and its software implementation on the ST321 processor
, 2009
"... This paper deals with the design and implementation of low latency software for binary floatingpoint division with correct rounding to nearest. The approach we present here targets a VLIW integer processor of the ST200 family, and is based on fast and accurate programs for evaluating some particula ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
This paper deals with the design and implementation of low latency software for binary floatingpoint division with correct rounding to nearest. The approach we present here targets a VLIW integer processor of the ST200 family, and is based on fast and accurate programs for evaluating some particular bivariate polynomials. We start by giving approximation and evaluation error conditions that are sufficient to ensure correct rounding. Then we describe the heuristics used to generate such evaluation programs, as well as those used to automatically validate their accuracy. Finally, we propose, for the binary32 format, a complete C implementation of the resulting division algorithm. With the ST200 compiler and compared to previous implementations, the speedup observed with our approach is by a factor of almost 1.8.
ParallelArray Implementations of A NonRestoring Square Root Algorithm
, 1997
"... In this paper, we present a parallelarray implementation of a new nonrestoring square root algorithm (PASQRT). The carrysave adder (CSA) is used in the parallel array. The PASQRT has several features unlike other implementations. First, it does not use redundant representation for square root res ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper, we present a parallelarray implementation of a new nonrestoring square root algorithm (PASQRT). The carrysave adder (CSA) is used in the parallel array. The PASQRT has several features unlike other implementations. First, it does not use redundant representation for square root result. Second, each iteration generates an exact resulting value. Next, it does not require any conversion on the inputs of the CSA. And last, a precise remainder can be obtained immediately. Furthermore, we present an improved version  a rootselect parallelarray implementation (RSPASQRT) for fast result value generation. The RSPASQRT is capable of achieving up to about 150% speedup ratio over the PASQRT. The simplicity of the implementations indicates that the proposed approach is an alternative to consider when designing a fully pipelined square root unit. 1.