Results 1  10
of
15
An Analysis Of Division Algorithms And Implementations
 IEEE Transactions on Computers
, 1995
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to al ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to all aspects of floatingpoint computation. Many algorithms are suitable for implementing division in hardware. This paper presents four major classes of algorithms in a unified framework, namely digit recurrence, functional iteration, very high radix, and variable latency. Digit recurrence algorithms, the most common of which is SRT, use subtraction as the fundamental operator, and they converge to a quotient linearly. Division by functional iteration converges to a quotient quadratically using multiplication. Very high radix division algorithms are similar to digit recurrence algorithms, but they incorporate multiplication to reduce the latency. Variable latency division algorithms reduce the...
Telescopic Units: A New Paradigm for Performance Optimization of VLSI Designs
 IEEE Trans. ComputerAided Design
, 1998
"... This paper introduces a novel optimization paradigm for increasing the throughput of digital systems. The basic idea consists of transforming fixedlatency units into variablelatency ones that run with a faster clock cycle. The transformation is fully automatic and can be used in conjunction with t ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
This paper introduces a novel optimization paradigm for increasing the throughput of digital systems. The basic idea consists of transforming fixedlatency units into variablelatency ones that run with a faster clock cycle. The transformation is fully automatic and can be used in conjunction with traditional design techniques to improve the overall performance of speedcritical units. In addition, we introduce procedures for reducing the area overhead of the modified units, and we formulate an algorithm for automatically restructuring the controllers of the data paths in which variablelatency units have been introduced. Results, obtained on a large set of benchmark circuits, show an average throughput improvement exceeding 27%, at the price of a modest area increase (less than 8% on average).
Design Issues In High Performance Floating Point Arithmetic Units
, 1996
"... In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, suc ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the stateoftheart in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing everfaster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table lookup, and variable latency. While division is an infrequent operation...
Modular Verification of SRT Division
, 1996
"... . We describe a formal specification and mechanized verification in PVS of the general theory of SRT division along with a specific hardware realization of the algorithm. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
. We describe a formal specification and mechanized verification in PVS of the general theory of SRT division along with a specific hardware realization of the algorithm. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to be developed in a readable manner that is similar to textbook presentations, while the PVS table construct allows direct specification of the implementation's quotient lookup table. Verification of the derivations in the SRT theory and for the data path and lookup table of the implementation are highly automated and performed for arbitrary, but finite precision; in addition, the theory is verified for general radix, while the implementation is specialized to radix 4. The effectiveness of the automation stems from the tight integration in PVS of rewriting with decision procedures for equality, linear arithmetic over integers and rationals, and propositional logic. This example demonstrates t...
SRT Division Architectures and Implementations
 IN PROC. 13TH IEEE SYMP. COMPUTER ARITHMETIC
, 1997
"... SRT dividers are common in modern floating point units. Higher division performance is achieved by retiring more quotient bits in each cycle. Previous research has shown that realistic stages are limited to radix2 and radix4. Higher radix dividers are therefore formed by a combination of lowradix ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
SRT dividers are common in modern floating point units. Higher division performance is achieved by retiring more quotient bits in each cycle. Previous research has shown that realistic stages are limited to radix2 and radix4. Higher radix dividers are therefore formed by a combination of lowradix stages. In this paper, we present an analysis of the effects of radix2 and radix4 SRT divider architectures and circuit families on divider area and performance. We show the performance and area results for a wide variety of divider architectures and implementations. We conclude that divider performance is only weakly sensitive to reasonable choices of architecture but significantly improved by aggressive circuit techniques.
Automatic synthesis of large telescopic units based on nearminimum timed supersetting
 IEEE Trans. on Comp
, 1999
"... AbstractÐIn highperformance systems, variablelatency units are often employed to improve the average throughput when the worstcase delay exceeds the cycle time. Traditionally, units of this type have been handdesigned. In this paper, we propose a technique for the automatic synthesis of variable ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
AbstractÐIn highperformance systems, variablelatency units are often employed to improve the average throughput when the worstcase delay exceeds the cycle time. Traditionally, units of this type have been handdesigned. In this paper, we propose a technique for the automatic synthesis of variablelatency units that is applicable to large datapath modules. We define and study an optimization problem, timed supersetting, whose solution is at the kernel of the procedure for automatic generation of variablelatency units. We contribute a new algorithm for solving timed supersetting in the most difficult case, that is, when the timing behavior of the circuit is expressed through an accurate delay model. The proposed solution overcomes the computational limitations of previous approaches and its robustness is experimentally demonstrated by obtaining highthroughput, variablelatency implementations for all the largest circuits in the Iscas '85 and Iscas '89 benchmark suites, as well as for some realistic, highperformance arithmetic units. Index TermsÐLogic synthesis, timing analysis, throughput optimization. æ 1
A Variable Latency Pipelined FloatingPoint Adder
 In Proc. EUROPAR'96 Parallel Processing, volume LNCS 1124
, 1996
"... . Addition is the most frequent floatingpoint operation in modern microprocessors. Due to its complex shiftaddshiftround dataflow, floatingpoint addition can have a long latency. To achieve maximum system performance, it is necessary to design the floatingpoint adder to have minimum latenc ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
. Addition is the most frequent floatingpoint operation in modern microprocessors. Due to its complex shiftaddshiftround dataflow, floatingpoint addition can have a long latency. To achieve maximum system performance, it is necessary to design the floatingpoint adder to have minimum latency, while still providing maximum throughput. This paper proposes a new floatingpoint addition algorithm which exploits the ability of dynamicallyscheduled processors to utilize functional units which complete in variable time. By recognizing that certain operand combinations do not require all of the steps in the complex addition dataflow, the average latency is reduced. Simulation on SPECfp92 applications demonstrates that a speedup in average addition latency of 1.33 can be achieved using this algorithm while maintaining single cycle throughput. 1 Introduction Floatingpoint (FP) addition and subtraction are very frequent floatingpoint operations. Together, they account for ov...
Fast IEEE Rounding for Division by Functional Iteration
, 1996
"... A class of high performance division algorithms is functional iteration. Division by functional iteration uses multiplication as the fundamental operator. The main advantage of division by functional iteration is quadratic convergence to the quotient. However, unlike nonrestoring division algorithm ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
A class of high performance division algorithms is functional iteration. Division by functional iteration uses multiplication as the fundamental operator. The main advantage of division by functional iteration is quadratic convergence to the quotient. However, unlike nonrestoring division algorithms such as SRT division, functional iteration does not directly provide a final remainder. This makes fast and exact rounding difficult. This paper clarifies the methodology for correct IEEE compliant rounding for quadraticallyconverging division algorithms. It proposes an extension to previously reported techniques of using extended precision in the computation to reduce the frequency of back multiplications required to obtain the final remainder. Further, a technique applicable to all IEEE rounding modes is presented which replaces the final subtraction for remainder computation with very simple combinational logic. Key Words and Phrases: Division, Goldschmidt's algorithm, IEEE rounding, NewtonRaphson, variable latency
HighRadix FloatingPoint Division Algorithms for Embedded VLIW
 Integer Processors, in "Proc. 17th World Congress on Scientific Computation, Applied Mathematics and Simulation IMACS
, 2005
"... Abstract — This work presents floatingpoint division algorithms and implementations for embedded VLIW integer processors. On those processors, there is no hardware floatingpoint unit, for cost reasons. But, for portability and/or accuracy reasons, a software floatingpoint emulation layer is someti ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract — This work presents floatingpoint division algorithms and implementations for embedded VLIW integer processors. On those processors, there is no hardware floatingpoint unit, for cost reasons. But, for portability and/or accuracy reasons, a software floatingpoint emulation layer is sometime useful. In this paper, we focus on highradix digitrecurrence algorithms for floatingpoint division on integer VLIW processors. Our algorithms are targeted for the ST200 processor from STMicroelectronics. Index Terms — computer arithmetic, floatingpoint arithmetic, division, digitrecurrence algorithm, SRT algorithm, highradix algorithm, integer processor, embedded processor, VLIW processor. I.
A new binary floatingpoint division algorithm and its software implementation on the ST321 processor
, 2009
"... This paper deals with the design and implementation of low latency software for binary floatingpoint division with correct rounding to nearest. The approach we present here targets a VLIW integer processor of the ST200 family, and is based on fast and accurate programs for evaluating some particula ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
This paper deals with the design and implementation of low latency software for binary floatingpoint division with correct rounding to nearest. The approach we present here targets a VLIW integer processor of the ST200 family, and is based on fast and accurate programs for evaluating some particular bivariate polynomials. We start by giving approximation and evaluation error conditions that are sufficient to ensure correct rounding. Then we describe the heuristics used to generate such evaluation programs, as well as those used to automatically validate their accuracy. Finally, we propose, for the binary32 format, a complete C implementation of the resulting division algorithm. With the ST200 compiler and compared to previous implementations, the speedup observed with our approach is by a factor of almost 1.8.