Results 1 - 10
of
40
A proven correctly rounded logarithm in double-precision
- In Real Numbers and Computers, Schloss Dagstuhl
, 2004
"... Abstract. This article is a case study in the implementation of a portable, proven and efficient correctly rounded elementary function in double-precision. We describe the methodology used to achieve these goals in the crlibm library. There are two novel aspects to this approach. The first is the pr ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
Abstract. This article is a case study in the implementation of a portable, proven and efficient correctly rounded elementary function in double-precision. We describe the methodology used to achieve these goals in the crlibm library. There are two novel aspects to this approach. The first is the proof framework, and in general the techniques used to balance performance and provability. The second is the introduction of processor-specific optimization to get performance equivalent to the best current mathematical libraries, while trying to minimize the proof work. The implementation of the natural logarithm is detailed to illustrate these questions. Mathematics Subject Classification. 26-04, 65D15, 65Y99. 1.
Formal verification of IA-64 division algorithms
- Proceedings, Theorem Proving in Higher Order Logics (TPHOLs), LNCS 1869
, 2000
"... Abstract. The IA-64 architecture defers floating point and integer division to software. To ensure correctness and maximum efficiency, Intel provides a number of recommended algorithms which can be called as subroutines or inlined by compilers and assembly language programmers. All these algorithms ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Abstract. The IA-64 architecture defers floating point and integer division to software. To ensure correctness and maximum efficiency, Intel provides a number of recommended algorithms which can be called as subroutines or inlined by compilers and assembly language programmers. All these algorithms have been subjected to formal verification using the HOL Light theorem prover. As well as improving our level of confidence in the algorithms, the formal verification process has led to a better understanding of the underlying theory, allowing some significant efficiency improvements. 1
Towards the post-ultimate libm
, 2005
"... This article presents advances on the subject of correctly rounded elementary functions since the publication of the libultim mathematical library developed by Ziv at IBM. This library showed that the average performance and memory overhead of correct rounding could be made negligible. However, the ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
This article presents advances on the subject of correctly rounded elementary functions since the publication of the libultim mathematical library developed by Ziv at IBM. This library showed that the average performance and memory overhead of correct rounding could be made negligible. However, the worst-case overhead was still a factor 1000 or more. It is shown here that, with current processor technology, this worst-case overhead can be kept within a factor of 2 to 10 of current best libms. This low overhead has very positive consequences on the techniques for implementing and proving correctly rounded functions, which are also studied. These results lift the last technical obstacles to a generalisation of (at least some) correctly rounded double precision elementary functions.
A parameterized floating-point exponential function for FPGAs
- IN IEEE INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (FPT’05
, 2005
"... A parameterized floating-point exponential operator is presented. In single-precision, it uses a small fraction of the FPGA’s resources and has a smaller latency than its software equivalent on a high-end processor, and ten times the throughput in pipelined version. Previous work had shown that FPGA ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
A parameterized floating-point exponential operator is presented. In single-precision, it uses a small fraction of the FPGA’s resources and has a smaller latency than its software equivalent on a high-end processor, and ten times the throughput in pipelined version. Previous work had shown that FPGAs could use massive parallelism to balance the poor performance of their basic floating-point operators compared to the equivalent in processors. As this work shows, when evaluating an elementary function, the flexibility of FPGAs provides much better performance than the processor witout even resorting to parallelism.
Theorems on efficient argument reductions
- Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH16
, 2003
"... A commonly used argument reduction technique in elementary function computations begins with two positive floating point numbers α and γ that approximate (usually irrational but not necessarily) numbers 1/C and C, e.g., C = 2π for trigonometric functions and ln 2 for e x. Given an argument to the fu ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
A commonly used argument reduction technique in elementary function computations begins with two positive floating point numbers α and γ that approximate (usually irrational but not necessarily) numbers 1/C and C, e.g., C = 2π for trigonometric functions and ln 2 for e x. Given an argument to the function of interest it extracts z as defined by xα = z + ς with z = k2 −N and |ς | ≤ 2 −N−1, where k, N are integers and N ≥ 0 is preselected, and then computes u = x − zγ. Usually zγ takes more bits than the working precision provides for storing its significand, and thus exact x − zγ may not be represented exactly by a floating point number of the same precision. This will cause performance penalty when the working precision is the highest available on the underlying hardware and thus considerable extra work is needed to get all the bits of x − zγ right. This paper presents theorems that show under mild conditions that can be easily met on today’s computer hardware and still allow α ≈ 1/C and γ ≈ C to almost the full working precision, x − zγ is a floating point number of the same precision. An algorithmic procedure based on the theorems is obtained. The results will enhance performance, in particular on machines that has hardware support for fusedmultiply-add (fma) instruction(s). 1
Some Functions Computable with a Fused-mac
- in "Proc. 17th IEEE Symposium on Computer Arithmetic (ARITH-17), Cape Cod
, 2005
"... The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable usin ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable using a fused-mac. Then, we show that, with rounding to the nearest, the error of a fused-mac instruction is exactly representable as the sum of two floating-point numbers. We give an algorithm that computes that error.
Parameterized floating-point logarithm and exponential functions for FPGAs
- IN MICROPROCESSORS AND MICROSYSTEMS. ELSEVIER
, 2007
"... As FPGAs are increasingly being used for floating-point computing, the feasibility of a library of floating-point elementary functions for FPGAs is discussed. An initial implementation of such a library contains parameterized operators for the logarithm and exponential functions. In single precision ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
As FPGAs are increasingly being used for floating-point computing, the feasibility of a library of floating-point elementary functions for FPGAs is discussed. An initial implementation of such a library contains parameterized operators for the logarithm and exponential functions. In single precision, those operators use a small fraction of the FPGA’s resources, have a smaller latency than their software equivalent on a high-end processor, and provide about ten times the throughput in pipelined version. Previous work had shown that FPGAs could use massive parallelism to balance the poor performance of their basic floating-point operators compared to the equivalent in processors. As this work shows, when evaluating an elementary function, the flexibility of FPGAs provides much better performance than the processor without even resorting to parallelism. The presented library is freely available
A parameterizable floating-point logarithm operator for FPGAs
- IN 39TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS. IEEE SIGNAL PROCESSING SOCIETY
, 2005
"... As FPGAs are increasingly being used for floatingpoint computing, a parameterized floating-point logarithm operator is presented. In single precision, this operator uses a small fraction of the FPGA’s resources, has a smaller latency than its software equivalent on a high-end processor, and provid ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
As FPGAs are increasingly being used for floatingpoint computing, a parameterized floating-point logarithm operator is presented. In single precision, this operator uses a small fraction of the FPGA’s resources, has a smaller latency than its software equivalent on a high-end processor, and provides about ten times the throughput in pipelined version. Previous work had shown that FPGAs could use massive parallelism to balance the poor performance of their basic floating-point operators compared to the equivalent in processors. As this work shows, when evaluating an elementary function, the flexibility of FPGAs provides much better performance than the processor without even resorting to parallelism. The presented operator is freely available from
Correctly rounded multiplication by arbitrary precision constants
- IEEE Symposium on Computer Arithmetic, Research Report, n o 5354, INRIA
, 2005
"... Abstract—We introduce an algorithm for multiplying a floating-point number x by a constant C that is not exactly representable in floating-point arithmetic. Our algorithm uses a multiplication and a fused multiply and add instruction. Such instructions are available in some modern processors such as ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Abstract—We introduce an algorithm for multiplying a floating-point number x by a constant C that is not exactly representable in floating-point arithmetic. Our algorithm uses a multiplication and a fused multiply and add instruction. Such instructions are available in some modern processors such as the IBM Power PC and the Intel/HP Itanium. We give three methods for checking whether, for a given value of C and a given floating-point format, our algorithm returns a correctly rounded result for any x. When it does not, some of our methods return all of the values x for which the algorithm fails. The three methods are complementary: The first two do not always allow one to conclude, yet they are simple enough to be used at compile time, while the third one always either proves that our algorithm returns a correctly rounded result for any x or gives all of the counterexamples. We generalize our study to the case where a wider internal format is used for the intermediate calculations, which gives a fourth method. Our programs and some additional information (such as the case where an arbitrary nonbinary even radix is used), as well as examples of runs of our programs, can be downloaded from
Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance
- IEEE Transactions on Computers
, 2004
"... We present techniques for accelerating the floating-point computation of x=y when y is known before x. The proposed algorithms are oriented toward architectures with available fused-mac operations. The goal is to get exactly the same result as with usual division with rounding to nearest. It is kn ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We present techniques for accelerating the floating-point computation of x=y when y is known before x. The proposed algorithms are oriented toward architectures with available fused-mac operations. The goal is to get exactly the same result as with usual division with rounding to nearest. It is known that the advanced computation of 1=y allows performing correctly rounded division in one multiplication plus two fused-macs. We show algorithms that reduce this latency to one multiplication and one fused-mac. This is achieved if a precision of at least 1 bits is available, where n is the number of mantissa bits in the target format, or if y satisfies some properties that can be easily checked at compile-time. This requires a double-word approximation of 1=y (we also show how to get it). These techniques can be used by compilers to accelerate some numerical programs without loss of accuracy.

