Results 1  10
of
30
A ZeroOverhead SelfTimed 160ns 54b CMOS Divider
 IEEE Journal of SolidState Circuits
, 1991
"... Abstroct—This paper describes the design of a custom integrated circuit for the arithmetic operation of division. The chip uses selftiming to avoid the need for highspeed clocks, and directly concatenates precharged fnnctiou blocks without latches. Internal stages form a ring that cycles without a ..."
Abstract

Cited by 55 (3 self)
 Add to MetaCart
Abstroct—This paper describes the design of a custom integrated circuit for the arithmetic operation of division. The chip uses selftiming to avoid the need for highspeed clocks, and directly concatenates precharged fnnctiou blocks without latches. Internal stages form a ring that cycles without any external signaling. The selftimed control introduces no serial overhead, making the total chip latency equal to just the combinational logic delays of the data elements. The ring’s data path uses embedded completion encoding and generates the mantissa of a 54b (floatingpoint IEEE doubleprecision) result. Fabricated in 1.2pm CMOS, the ring occupies 7 mmz and generates a quotient and done indication in 45 to 160 ns, depending on the particular data operands. I.
An Analysis Of Division Algorithms And Implementations
 IEEE Transactions on Computers
, 1995
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to al ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to all aspects of floatingpoint computation. Many algorithms are suitable for implementing division in hardware. This paper presents four major classes of algorithms in a unified framework, namely digit recurrence, functional iteration, very high radix, and variable latency. Digit recurrence algorithms, the most common of which is SRT, use subtraction as the fundamental operator, and they converge to a quotient linearly. Division by functional iteration converges to a quotient quadratically using multiplication. Very high radix division algorithms are similar to digit recurrence algorithms, but they incorporate multiplication to reduce the latency. Variable latency division algorithms reduce the...
Accelerating MultiMedia Processing by Implementing Memoing in Multiplication and Division Units
 International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... This paper proposes a technique that enables performing multicycle (multiplication, division, squareroot...) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations and using the output if the input is encountered again. ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
This paper proposes a technique that enables performing multicycle (multiplication, division, squareroot...) computations in a single cycle. The technique is based on the notion of memoing: saving the input and output of previous calculations and using the output if the input is encountered again. This technique is especially suitable for MultiMedia (MM) processing. In MM applications the local entropy of the data tends to be low which results in repeated operations on the same datum. The inputs and outputs of assembly level operations are stored in cachelike lookup tables and accessed in parallel to the conventional computation. A successful lookup gives the result of a multicycle computation in a single cycle, and a failed lookup doesn't necessitate a penalty in computation time. Results of simulations have shown that on the average, for a modestly sized memotable, about 40 % of the floating point multiplications and 50 % of the floating point divisions, in MultiMedia applications, can be avoided by using the values within the memotable, leading to an average computational speedup of more than 20%.
Design Issues in Division and Other FloatingPoint Operations
 IEEE Transactions on Computers
, 1997
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, in the worst case, a high latency hardware floatingpoint divider can contribute an additional 0.50 CPI to a system executing SPECfp92 applications. This paper ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, in the worst case, a high latency hardware floatingpoint divider can contribute an additional 0.50 CPI to a system executing SPECfp92 applications. This paper presents the system performance impact of floatingpoint division latency for varying instruction issue rates. It also examines the performance implications of shared multiplication hardware, shared square root, onthefly rounding and conversion, and fused functional units. Using a system level study as a basis, it is shown how typical floatingpoint applications can guide the designer in making implementation decisions and tradeoffs.
Verification of All Circuits in a FloatingPoint Unit Using WordLevel Model Checking
 In Proceedings of the Formal Methods on ComputerAided Design
, 1996
"... This paper presents the formal verification of all subcircuits in a floatingpoint arithmetic unit (FPU) from an Intel microprocessor using a wordlevel model checker. This work represents the first largescale application of wordlevel model checking techniques. The FPU can perform addition, subtra ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
This paper presents the formal verification of all subcircuits in a floatingpoint arithmetic unit (FPU) from an Intel microprocessor using a wordlevel model checker. This work represents the first largescale application of wordlevel model checking techniques. The FPU can perform addition, subtraction, multiplication, square root, division, remainder, and rounding operations; verifying such a broad range of functionality required coupling the model checker with a number of other techniques, such as property decomposition, propertyspecific model extraction, and latch removal. We will illustrate our verification techniques using the Weitek WTL3170/3171 Sparc floating point coprocessor as an example. The principal contribution of this paper is a practical verification methodology explaining what techniques to apply (and where to apply them) when verifying floatingpoint arithmetic circuits. We have applied our methods to the floatingpoint unit of a stateoftheart Intel microprocesso...
BitLevel Analysis of an SRT Divider Circuit
 IN PROCEEDINGS OF THE 33RD DESIGN AUTOMATION CONFERENCE, PAGES 661665, LAS VEGAS, NV
, 1995
"... It is impractical to verify multiplier or divider circuits entirely at the bitlevel using ordered Binary Decision Diagrams (BDDs), because the BDD representations for these functions grow exponentially with the word size. It is possible, however, to analyze individual stages of these circuits using ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
It is impractical to verify multiplier or divider circuits entirely at the bitlevel using ordered Binary Decision Diagrams (BDDs), because the BDD representations for these functions grow exponentially with the word size. It is possible, however, to analyze individual stages of these circuits using BDDs. Such analysis can be helpful when implementing complex arithmetic algorithms. As a demonstration, we show that Intel could haveused BDDs to detect erroneous lookup table entries in the Pentium(TM) floating point divider. Going beyond verification, we show that bitlevel analysis can be used to generate a correct version of the table.
Anatomy of the Pentium Bug
 In TAPSOFT’95: Theory and Practice of Software Development
, 1995
"... The Pentium computer chip’s division algorithm relies on a table from which five entries were inadvertently omitted, with the result that 1738 single precision dividenddivisor pairs yield relative errors whose most significant bit is uniformly distributed from the 14th to the 23rd (least significant ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
The Pentium computer chip’s division algorithm relies on a table from which five entries were inadvertently omitted, with the result that 1738 single precision dividenddivisor pairs yield relative errors whose most significant bit is uniformly distributed from the 14th to the 23rd (least significant) bit. This corresponds to a rate of one error every 40 billion random single precision divisions. The same general pattern appears at double precision, with an error rate of one in every 9 billion divisions or 75 minutes of division time. These rates assume randomly distributed data. The distribution of the faulty pairs themselves however is far from random, with the effect that if the data is so nonrandom as to be just the constant 1, then random calculations started from that constant produce a division error once every few minutes, and these errors will sometimes propagate many more steps. A much higher rate yet is obtained when dividing small (< 100) integers “bruised ” by subtracting one millionth, where every 400 divisions will see a relative error of at least one in a million. The software engineering implications of the bug include the observations that the method of exercising reachable components cannot detect reachable components mistakenly believed unreachable, and that handchecked proofs build false confidence. 1
Design Issues In High Performance Floating Point Arithmetic Units
, 1996
"... In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, suc ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the stateoftheart in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing everfaster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table lookup, and variable latency. While division is an infrequent operation...
Modular Verification of SRT Division
, 1996
"... . We describe a formal specification and mechanized verification in PVS of the general theory of SRT division along with a specific hardware realization of the algorithm. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
. We describe a formal specification and mechanized verification in PVS of the general theory of SRT division along with a specific hardware realization of the algorithm. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to be developed in a readable manner that is similar to textbook presentations, while the PVS table construct allows direct specification of the implementation's quotient lookup table. Verification of the derivations in the SRT theory and for the data path and lookup table of the implementation are highly automated and performed for arbitrary, but finite precision; in addition, the theory is verified for general radix, while the implementation is specialized to radix 4. The effectiveness of the automation stems from the tight integration in PVS of rewriting with decision procedures for equality, linear arithmetic over integers and rationals, and propositional logic. This example demonstrates t...
SRT Division Architectures and Implementations
 IN PROC. 13TH IEEE SYMP. COMPUTER ARITHMETIC
, 1997
"... SRT dividers are common in modern floating point units. Higher division performance is achieved by retiring more quotient bits in each cycle. Previous research has shown that realistic stages are limited to radix2 and radix4. Higher radix dividers are therefore formed by a combination of lowradix ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
SRT dividers are common in modern floating point units. Higher division performance is achieved by retiring more quotient bits in each cycle. Previous research has shown that realistic stages are limited to radix2 and radix4. Higher radix dividers are therefore formed by a combination of lowradix stages. In this paper, we present an analysis of the effects of radix2 and radix4 SRT divider architectures and circuit families on divider area and performance. We show the performance and area results for a wide variety of divider architectures and implementations. We conclude that divider performance is only weakly sensitive to reasonable choices of architecture but significantly improved by aggressive circuit techniques.