Results 1  10
of
21
A MachineChecked Theory of Floating Point Arithmetic
, 1999
"... . Intel is applying formal verification to various pieces of mathematical software used in Merced, the first implementation of the new IA64 architecture. This paper discusses the development of a generic floating point library giving definitions of the fundamental terms and containing formal pr ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
. Intel is applying formal verification to various pieces of mathematical software used in Merced, the first implementation of the new IA64 architecture. This paper discusses the development of a generic floating point library giving definitions of the fundamental terms and containing formal proofs of important lemmas. We also briefly describe how this has been used in the verification effort so far. 1 Introduction IA64 is a new 64bit computer architecture jointly developed by HewlettPackard and Intel, and the forthcoming Merced chip from Intel will be its first silicon implementation. To avoid some of the limitations of traditional architectures, IA64 incorporates a unique combination of features, including an instruction format encoding parallelism explicitly, instruction predication, and speculative /advanced loads [4]. Nevertheless, it also offers full upwardscompatibility with IA32 (x86) code. 1 IA64 incorporates a number of floating point operations, the centerpi...
Design Issues In High Performance Floating Point Arithmetic Units
, 1996
"... In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, suc ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the stateoftheart in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing everfaster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table lookup, and variable latency. While division is an infrequent operation...
Formal verification of IA64 division algorithms
 Proceedings, Theorem Proving in Higher Order Logics (TPHOLs), LNCS 1869
, 2000
"... Abstract. The IA64 architecture defers floating point and integer division to software. To ensure correctness and maximum efficiency, Intel provides a number of recommended algorithms which can be called as subroutines or inlined by compilers and assembly language programmers. All these algorithms ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
Abstract. The IA64 architecture defers floating point and integer division to software. To ensure correctness and maximum efficiency, Intel provides a number of recommended algorithms which can be called as subroutines or inlined by compilers and assembly language programmers. All these algorithms have been subjected to formal verification using the HOL Light theorem prover. As well as improving our level of confidence in the algorithms, the formal verification process has led to a better understanding of the underlying theory, allowing some significant efficiency improvements. 1
SystemLevel Power Consumption Modeling and Tradeoff Analysis Techniques for Superscalar Processor Design
, 1997
"... Highlevel decisions in highperformance processors are often decoupled from their ultimate impact on power usage. For example, superscalar hardware and high degrees of pipelining are excellent sources for high parallelism. They often result in higher power usage. This problem is further complicated ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Highlevel decisions in highperformance processors are often decoupled from their ultimate impact on power usage. For example, superscalar hardware and high degrees of pipelining are excellent sources for high parallelism. They often result in higher power usage. This problem is further complicated by the usage patterns of each unit in the processor. The usage patterns are determined by the programs the system executes, and ultimately by the applications the processor is targeted towards. This paper presents systematic techniques to find lowpower, highperformance superscalar processors tailored to specific user applications. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via tracedriven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is base...
Some functions computable with a fusedmac
 in Proceedings of the 17th Symposium on Computer Arithmetic, P. Montuschi and E. Schwarz, Eds., Cape Cod
, 2005
"... The fused multiply accumulate instruction (fusedmac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floatingpoint functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable usin ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
The fused multiply accumulate instruction (fusedmac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floatingpoint functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable using a fusedmac. Then, we show that, with rounding to the nearest, the error of a fusedmac instruction is exactly representable as the sum of two floatingpoint numbers. We give an algorithm that computes that error. 1
Correctly rounded multiplication by arbitrary precision constants
 IEEE Symposium on Computer Arithmetic, Research Report, n o 5354, INRIA
, 2005
"... Abstract—We introduce an algorithm for multiplying a floatingpoint number x by a constant C that is not exactly representable in floatingpoint arithmetic. Our algorithm uses a multiplication and a fused multiply and add instruction. Such instructions are available in some modern processors such as ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Abstract—We introduce an algorithm for multiplying a floatingpoint number x by a constant C that is not exactly representable in floatingpoint arithmetic. Our algorithm uses a multiplication and a fused multiply and add instruction. Such instructions are available in some modern processors such as the IBM Power PC and the Intel/HP Itanium. We give three methods for checking whether, for a given value of C and a given floatingpoint format, our algorithm returns a correctly rounded result for any x. When it does not, some of our methods return all of the values x for which the algorithm fails. The three methods are complementary: The first two do not always allow one to conclude, yet they are simple enough to be used at compile time, while the third one always either proves that our algorithm returns a correctly rounded result for any x or gives all of the counterexamples. We generalize our study to the case where a wider internal format is used for the intermediate calculations, which gives a fourth method. Our programs and some additional information (such as the case where an arbitrary nonbinary even radix is used), as well as examples of runs of our programs, can be downloaded from
A technique to determine powerefficient, highperformance superscalar processors
 In Proceedings of the TwentyEighth Hawaii International Conference on System Sciences
, 1995
"... Processor performance advances are increasingly inhibit(ed by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although highperformance designs exist that dissipate low power, the method for finding these designs has ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Processor performance advances are increasingly inhibit(ed by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although highperformance designs exist that dissipate low power, the method for finding these designs has bc:en through trialanderror. This paper presents systematic techniques to find lowpower, highperformance superscalar processors tailored to specific user benchmarks. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via tracedriven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is bwed on case studies of actual designs. It is used to solve an important problem: increasing the duplication in superscalar execution units without excessive power consumption. Results are present#ed from runs using simulated annealing to maximize processor performance subject to power and area const#raints. The major contributions of this paper are the separation of architectural and technology components of dynamic power, the use of tracedriven simulation for architectural power measurement, and the use of a nearoptimal search t,o tailor a processor design to a benchmark. 1
Accelerating correctly rounded floatingpoint division when the divisor is known in advance
 IEEE Transactions on Computers
, 2004
"... optimization. We present techniques for accelerating the floatingpoint computation of x/y when y is known before x. The proposed algorithms are oriented towards architectures with available fusedMAC operations. The goal is to get exactly the same result as with usual division with rounding to near ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
optimization. We present techniques for accelerating the floatingpoint computation of x/y when y is known before x. The proposed algorithms are oriented towards architectures with available fusedMAC operations. The goal is to get exactly the same result as with usual division with rounding to nearest. These techniques can be used by compilers to accelerate some numerical programs without loss of accuracy. 1 Motivation of this research We wish to provide methods for accelerating floatingpoint divisions of the form x/y, when y is known before x, either at compiletime, or at run time. We assume that a fused multiplyaccumulator is available, and that division is done in software (this happens for instance on RS6000, PowerPC or Itanium architectures). The computed result must be the correctly rounded result. A naive approach consists in computing the reciprocal of y (with rounding to nearest), and then, once x is available, multiplying the obtained result by x. It is well known
Tradeoff of nSelect Square Root Implementations Cost/Performance
"... Hardware squareroot units require large numbers of gates even for iterative implementations. In this paper, we present four lowcost highperformance fullypipelined nselect implementations (nSRoot) based on a nonrestoringremainder square root algorithm. The nS Root uses a parallel array of c ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Hardware squareroot units require large numbers of gates even for iterative implementations. In this paper, we present four lowcost highperformance fullypipelined nselect implementations (nSRoot) based on a nonrestoringremainder square root algorithm. The nS Root uses a parallel array of carrysave adders (CSAs). For a square root bit calculation, a CSA is used once. This means that the calculations can be fully pipelined. It also uses the nway rootselect technique to speedup the square root calculation. The cost/performance evaluation shows that n=2 or n=2.5 is a suitable solution for designing a highspeed fully pipelined square root unit while keeping the lowcost. 1
Formally Verified Argument Reduction with a FusedMultiplyAdd, in
 n o 8, 2009, p. 11391145, http://arxiv.org/abs/0708.3722 US . Activity Report INRIA 2009
"... Abstract — Cody & Waite argument reduction technique works perfectly for reasonably large arguments but as the input grows there are no bit left to approximate the constant with enough accuracy. Under mild assumptions, we show that the result computed with a fusedmultiplyadd provides a fully a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract — Cody & Waite argument reduction technique works perfectly for reasonably large arguments but as the input grows there are no bit left to approximate the constant with enough accuracy. Under mild assumptions, we show that the result computed with a fusedmultiplyadd provides a fully accurate result for many possible values of the input with a constant almost accurate to the full working precision. We also present an algorithm for a fully accurate second reduction step to reach double full accuracy (all the significand bits of two numbers are significant) even in the worst cases of argument reduction. Our work recalls the common algorithms and presents proofs of correctness. All the proofs are formally verified using the Coq automatic proof checker. Index Terms — Argument reduction, fma, formal proof, Coq. I.