Results 1  10
of
20
An Analysis Of Division Algorithms And Implementations
 IEEE Transactions on Computers
, 1995
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to al ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks forces processor designers to pay close attention to all aspects of floatingpoint computation. Many algorithms are suitable for implementing division in hardware. This paper presents four major classes of algorithms in a unified framework, namely digit recurrence, functional iteration, very high radix, and variable latency. Digit recurrence algorithms, the most common of which is SRT, use subtraction as the fundamental operator, and they converge to a quotient linearly. Division by functional iteration converges to a quotient quadratically using multiplication. Very high radix division algorithms are similar to digit recurrence algorithms, but they incorporate multiplication to reduce the latency. Variable latency division algorithms reduce the...
The Symmetric Table Addition Method for Accurate Function Approximation
 Journal of VLSI Signal Processing
, 1999
"... . This paper presents a highspeed method for computing elementary functions using parallel table lookups and multioperand addition. Increasing the number of tables and inputs to the multioperand adder significantly reduces the amount of memory required. Symmetry and leading zeros in the table co ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
. This paper presents a highspeed method for computing elementary functions using parallel table lookups and multioperand addition. Increasing the number of tables and inputs to the multioperand adder significantly reduces the amount of memory required. Symmetry and leading zeros in the table coefficients are used to reduce the amount of memory even further. This method has a closedform solution for the table entries and can be applied to any differentiable function. For 24bit operands, this method requires two to three orders of magnitude less memory than conventional table lookups. Keywords: Elementary functions, table lookups, approximations, multioperand addition, computer arithmetic, hardware design. 1. Introduction Elementary function approximations are important in scientific computing, computer graphics, and digital signal processing applications. In the systolic array implementation of Cholesky decomposition, presented in [1], 30% of the cells approximate reciprocals...
Symmetric Bipartite Tables for Accurate Function Approximation
 Proceedings of the 13th IEEE Symposium on Computer Arithmetic. IEEE Computer
, 1997
"... This paper presents a methodology for designing bipartite tables for accurate function approximation. Bipartite tables use two parallel table lookups to obtain a carrysave (borrowsave) function approximation. A carry propagate adder can then convert this approximation to a two's complement number ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
This paper presents a methodology for designing bipartite tables for accurate function approximation. Bipartite tables use two parallel table lookups to obtain a carrysave (borrowsave) function approximation. A carry propagate adder can then convert this approximation to a two's complement number or the approximation can be directly Booth encoded. Our method for designing bipartite tables, called the Symmetric Bipartite Table Method, utilizes symmetry in the table entries to reduce the overall memory requirements. It has several advantages over previous bipartite table methods in that it (1) provides a closed form solution for the table entries, (2) has tight bounds on the maximum absolute error, (3) requires smaller table lookups to achieve a given accuracy, and (4) can be applied to a wide range of functions. Compared to conventional table lookups, the symmetric bipartite tables presented in this paper are 15.0 to 41.7 times smaller when the operand size is 16 bits and 99.1 to 273....
Design Issues in Division and Other FloatingPoint Operations
 IEEE Transactions on Computers
, 1997
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, in the worst case, a high latency hardware floatingpoint divider can contribute an additional 0.50 CPI to a system executing SPECfp92 applications. This paper ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, in the worst case, a high latency hardware floatingpoint divider can contribute an additional 0.50 CPI to a system executing SPECfp92 applications. This paper presents the system performance impact of floatingpoint division latency for varying instruction issue rates. It also examines the performance implications of shared multiplication hardware, shared square root, onthefly rounding and conversion, and fused functional units. Using a system level study as a basis, it is shown how typical floatingpoint applications can guide the designer in making implementation decisions and tradeoffs.
Design Issues In High Performance Floating Point Arithmetic Units
, 1996
"... In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, suc ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
In recent years computer applications have increased in their computational complexity. The industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. Special purpose applications, such as high performance graphics rendering systems, have placed further demands on processors. High speed floating point hardware is a requirement to meet these increasing demands. This work examines the stateoftheart in FPU design and proposes techniques for improving the performance and the performance/area ratio of future FPUs. In recent FPUs, emphasis has been placed on designing everfaster adders and multipliers, with division receiving less attention. The design space of FP dividers is large, comprising five different classes of division algorithms: digit recurrence, functional iteration, very high radix, table lookup, and variable latency. While division is an infrequent operation...
On Division And Reciprocal Caches
, 1995
"... Floatingpoint division is generally regarded as a high latency operation in typical floatingpoint applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the laten ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a high latency operation in typical floatingpoint applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the latency of division. Using applications from the SPECfp92 and NAS benchmark suites, these methods are evaluated to determine their effects on overall system performance. The notion of recurring computation is presented, and it is shown how recurring division can be exploited using an additional, dedicated division cache. Additionally, for multiplicationbased division algorithms, reciprocal caches can be utilized to store recurring reciprocals. Due to the similarity between the algorithms typically used to compute division and square root, the performance of square root caches is also investigated. Results show that reciprocal caches can achieve nearly a 2X reduction in effective division latency...
Design Issues in FloatingPoint Division
, 1994
"... Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay c ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a low frequency, high latency operation in typical floatingpoint applications. However, the increasing emphasis on high performance graphics and the industrywide usage of performance benchmarks, such as SPECmarks, forces processor designers to pay close attention to all aspects of floatingpoint computation. This paper presents the algorithms often utilized for floatingpoint division, and it also presents implementation alternatives available for designers. Using a system level study as a basis, it is shown how typical floatingpoint applications can guide the designer in making implementation decisions and tradeoffs.
Reducing Division Latency with Reciprocal Caches
 Reliable Computing
, 1996
"... Introduction Floatingpoint division has received increasing attention in recent years. Division has a higher latency, or time required to perform a computation, than addition or multiplication. While division is an infrequent operation even in floatingpoint intensive applications, its high latenc ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Introduction Floatingpoint division has received increasing attention in recent years. Division has a higher latency, or time required to perform a computation, than addition or multiplication. While division is an infrequent operation even in floatingpoint intensive applications, its high latency can result in significant system performance degradation [4]. Many methods for implementing high performance division have appeared in the literature. However, any proposed division performance enhancement should be analyzed in terms of its possible silicon area and cycle time effects. Richardson [6] discusses the technique of result caching as a means of decreasing the latency of otherwise highlatency operations, such as division. Result caching is based on recurring or redundant computations that can be found in applications. Often, one or both of the input operands for a calculation are the same as those in a previous calculation. In matrix inversion, for example, each entry mu
The Setup for Triangle Rasterization
, 1996
"... Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsyste ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsystem. A setup architecture handling generalized triangle meshes and computing all necessary parameters for a highend raster pipeline to generate Gouraud shaded, texture and bumpmapped triangles is described and its benefits on the final bandwidth are shown. To efficiently compute the slopes and color gradients for each triangle, some implementation aspects on division and multiplication pipelines are discussed. The Setup for Triangle Rasterization Anders Kugler University of Tübingen  Computer Graphics Laboratory (1) (1) Universität Tübingen WilhelmSchickardInstitut für Informatik GraphischInteraktive Systeme Auf der Morgenstelle 10 D72076 Tübingen  Germany email: kugler@gris.unit...
Parallel Square and Cube Computations
 In IEEE 34th Asilomar Confernce on Signals, Systems and Computers
, 2000
"... Typically multipliers are used to compute the square and cube of an operand. A squaring unit can be used to compute the square of an operand faster and more efficiently than a multiplier. This paper proposes a parallel cubing unit that computes the cube of an operand 25 to 30% faster than can be com ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
Typically multipliers are used to compute the square and cube of an operand. A squaring unit can be used to compute the square of an operand faster and more efficiently than a multiplier. This paper proposes a parallel cubing unit that computes the cube of an operand 25 to 30% faster than can be computed using multipliers. Furthermore, the reduced squaring and cubing units are mathematically modeled and the performance and area requirements are studied for operands up to 54 bits in length. The applicability of the proposed cubing circuit is discussed with relation to the current NewtonRaphson and Taylor series function evaluation units. 1. Introduction Iterative techniques such as the NewtonRaphson and Taylor series expansion can be used to compute the reciprocal, square root, inverse square root, and other elementary functions. Using higherorder function approximations decreases the number of iterations required to achieve a desired precision. Using fast and efficient parallel squ...