Results 1 - 10
of
11
A Machine-Checked Theory of Floating Point Arithmetic
, 1999
"... . Intel is applying formal verification to various pieces of mathematical software used in Merced, the first implementation of the new IA-64 architecture. This paper discusses the development of a generic floating point library giving definitions of the fundamental terms and containing formal pr ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
. Intel is applying formal verification to various pieces of mathematical software used in Merced, the first implementation of the new IA-64 architecture. This paper discusses the development of a generic floating point library giving definitions of the fundamental terms and containing formal proofs of important lemmas. We also briefly describe how this has been used in the verification effort so far. 1 Introduction IA-64 is a new 64-bit computer architecture jointly developed by Hewlett-Packard and Intel, and the forthcoming Merced chip from Intel will be its first silicon implementation. To avoid some of the limitations of traditional architectures, IA-64 incorporates a unique combination of features, including an instruction format encoding parallelism explicitly, instruction predication, and speculative /advanced loads [4]. Nevertheless, it also offers full upwards-compatibility with IA-32 (x86) code. 1 IA-64 incorporates a number of floating point operations, the centerpi...
Design Issues In High Performance Floating Point Arithmetic Units
, 1996
"... In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, suc ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the state-of-the-art in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing ever-faster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table look-up, and variable latency. While division is an infrequent operation...
Formal verification of IA-64 division algorithms
- Proceedings, Theorem Proving in Higher Order Logics (TPHOLs), LNCS 1869
, 2000
"... Abstract. The IA-64 architecture defers floating point and integer division to software. To ensure correctness and maximum efficiency, Intel provides a number of recommended algorithms which can be called as subroutines or inlined by compilers and assembly language programmers. All these algorithms ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Abstract. The IA-64 architecture defers floating point and integer division to software. To ensure correctness and maximum efficiency, Intel provides a number of recommended algorithms which can be called as subroutines or inlined by compilers and assembly language programmers. All these algorithms have been subjected to formal verification using the HOL Light theorem prover. As well as improving our level of confidence in the algorithms, the formal verification process has led to a better understanding of the underlying theory, allowing some significant efficiency improvements. 1
System-Level Power Consumption Modeling and Tradeoff Analysis Techniques for Superscalar Processor Design
, 1997
"... High-level decisions in high-performance processors are often decoupled from their ultimate impact on power usage. For example, superscalar hardware and high degrees of pipelining are excellent sources for high parallelism. They often result in higher power usage. This problem is further complicated ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
High-level decisions in high-performance processors are often decoupled from their ultimate impact on power usage. For example, superscalar hardware and high degrees of pipelining are excellent sources for high parallelism. They often result in higher power usage. This problem is further complicated by the usage patterns of each unit in the processor. The usage patterns are determined by the programs the system executes, and ultimately by the applications the processor is targeted towards. This paper presents systematic techniques to find low-power, high-performance superscalar processors tailored to specific user applications. The model of power is novel because it separates power into architectural and technology components. The architectural component is found via trace-driven simulation, which also produces performance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is base...
Some Functions Computable with a Fused-mac
- in "Proc. 17th IEEE Symposium on Computer Arithmetic (ARITH-17), Cape Cod
, 2005
"... The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable usin ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
The fused multiply accumulate instruction (fused-mac) that is available on some current processors such as the Power PC or the Itanium eases some calculations. We give examples of some floating-point functions (such as ulp(x) or Nextafter(x, y)), or some useful tests, that are easily computable using a fused-mac. Then, we show that, with rounding to the nearest, the error of a fused-mac instruction is exactly representable as the sum of two floating-point numbers. We give an algorithm that computes that error.
Correctly rounded multiplication by arbitrary precision constants
- IEEE Symposium on Computer Arithmetic, Research Report, n o 5354, INRIA
, 2005
"... Abstract—We introduce an algorithm for multiplying a floating-point number x by a constant C that is not exactly representable in floating-point arithmetic. Our algorithm uses a multiplication and a fused multiply and add instruction. Such instructions are available in some modern processors such as ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Abstract—We introduce an algorithm for multiplying a floating-point number x by a constant C that is not exactly representable in floating-point arithmetic. Our algorithm uses a multiplication and a fused multiply and add instruction. Such instructions are available in some modern processors such as the IBM Power PC and the Intel/HP Itanium. We give three methods for checking whether, for a given value of C and a given floating-point format, our algorithm returns a correctly rounded result for any x. When it does not, some of our methods return all of the values x for which the algorithm fails. The three methods are complementary: The first two do not always allow one to conclude, yet they are simple enough to be used at compile time, while the third one always either proves that our algorithm returns a correctly rounded result for any x or gives all of the counterexamples. We generalize our study to the case where a wider internal format is used for the intermediate calculations, which gives a fourth method. Our programs and some additional information (such as the case where an arbitrary nonbinary even radix is used), as well as examples of runs of our programs, can be downloaded from
A technique to determine power-efficient, high-performance superscalar processors
- In Proceedings of the Twenty-Eighth Hawaii International Conference on System Sciences
, 1995
"... Processor performance advances are increasingly in-hibit(ed by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although high-performance designs exist that dissipate low power, the method for finding these designs has ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Processor performance advances are increasingly in-hibit(ed by limitations in thermal power dissipation. Part of the problem is the lack of architectural power estimates before implementation. Although high-performance designs exist that dissipate low power, the method for finding these designs has bc:en through trial-and-error. This paper presents system-atic techniques to find low-power, high-performance superscalar processors tailored to specific user bench-marks. The model of power is novel because it sep-arates power into architectural and technology com-ponents. The architectural component is found via trace-driven simulation, which also produces perfor-mance estimates. An example technology model is presented that estimates the technology component, along with critical delay time and real estate usage. This model is bwed on case studies of actual designs. It is used to solve an important problem: increasing the duplication in superscalar execution units without excessive power consumption. Results are present#ed from runs using simulated annealing to maximize pro-cessor performance subject to power and area con-st#raints. The major contributions of this paper are the sep-aration of architectural and technology components of dynamic power, the use of trace-driven simulation for architectural power measurement, and the use of a near-optimal search t,o tailor a processor design to a benchmark. 1
Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance
- IEEE Transactions on Computers
, 2004
"... We present techniques for accelerating the floating-point computation of x=y when y is known before x. The proposed algorithms are oriented toward architectures with available fused-mac operations. The goal is to get exactly the same result as with usual division with rounding to nearest. It is kn ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We present techniques for accelerating the floating-point computation of x=y when y is known before x. The proposed algorithms are oriented toward architectures with available fused-mac operations. The goal is to get exactly the same result as with usual division with rounding to nearest. It is known that the advanced computation of 1=y allows performing correctly rounded division in one multiplication plus two fused-macs. We show algorithms that reduce this latency to one multiplication and one fused-mac. This is achieved if a precision of at least 1 bits is available, where n is the number of mantissa bits in the target format, or if y satisfies some properties that can be easily checked at compile-time. This requires a double-word approximation of 1=y (we also show how to get it). These techniques can be used by compilers to accelerate some numerical programs without loss of accuracy.
Tradeoff of n-Select Square Root Implementations Cost/Performance
"... Hardware square-root units require large numbers of gates even for iterative implementations. In this paper, we present four low-cost high-performance fullypipelined n-select implementations (nS-Root) based on a non-restoring-remainder square root algorithm. The nS- Root uses a parallel array of c ..."
Abstract
- Add to MetaCart
Hardware square-root units require large numbers of gates even for iterative implementations. In this paper, we present four low-cost high-performance fullypipelined n-select implementations (nS-Root) based on a non-restoring-remainder square root algorithm. The nS- Root uses a parallel array of carry-save adders (CSAs). For a square root bit calculation, a CSA is used once. This means that the calculations can be fully pipelined. It also uses the n-way root-select technique to speedup the square root calculation. The cost/performance evaluation shows that n=2 or n=2.5 is a suitable solution for designing a high-speed fully pipelined square root unit while keeping the low-cost. 1
Exact and Approximated Error of the FMA
, 2011
"... The fused multiply accumulate-add (FMA) instruction, specified by the IEEE 754-2008 Standard for Floating-Point Arithmetic, eases some calculations, and is already available on some current processors such as the Power PC or the Itanium. We first extend an earlier work on the computation of the exa ..."
Abstract
- Add to MetaCart
The fused multiply accumulate-add (FMA) instruction, specified by the IEEE 754-2008 Standard for Floating-Point Arithmetic, eases some calculations, and is already available on some current processors such as the Power PC or the Itanium. We first extend an earlier work on the computation of the exact error of an FMA (by giving more general conditions and providing a formal proof). Then, we present a new algorithm that computes an approximation to the error of an FMA, and provide error bounds and a formal proof for that algorithm.

