Results 1 
9 of
9
Modular Verification of SRT Division
, 1996
"... . We describe a formal specification and verification in PVS for the general theory of SRT division, and for the hardware design of a specific implementation. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to be deve ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
. We describe a formal specification and verification in PVS for the general theory of SRT division, and for the hardware design of a specific implementation. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to be developed in a readable manner that is similar to textbook presentations, while the PVS table construct allows direct specification of the implementation's quotient lookup table. Verification of the derivations in the SRT theory and for the data path and lookup table of the implementation are highly automated and performed for arbitrary, but finite precision; in addition, the theory is verified for general radix, while the implementation is specialized to radix 4. The effectiveness of the automation derives from PVS's tight integration of rewriting with decision procedures for equality, linear arithmetic over integers and rationals, and propositional logic. This example demonstrates t...
On Division And Reciprocal Caches
, 1995
"... Floatingpoint division is generally regarded as a high latency operation in typical floatingpoint applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the laten ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Floatingpoint division is generally regarded as a high latency operation in typical floatingpoint applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the latency of division. Using applications from the SPECfp92 and NAS benchmark suites, these methods are evaluated to determine their effects on overall system performance. The notion of recurring computation is presented, and it is shown how recurring division can be exploited using an additional, dedicated division cache. Additionally, for multiplicationbased division algorithms, reciprocal caches can be utilized to store recurring reciprocals. Due to the similarity between the algorithms typically used to compute division and square root, the performance of square root caches is also investigated. Results show that reciprocal caches can achieve nearly a 2X reduction in effective division latency...
The Setup for Triangle Rasterization
, 1996
"... Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsyste ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsystem. A setup architecture handling generalized triangle meshes and computing all necessary parameters for a highend raster pipeline to generate Gouraud shaded, texture and bumpmapped triangles is described and its benefits on the final bandwidth are shown. To efficiently compute the slopes and color gradients for each triangle, some implementation aspects on division and multiplication pipelines are discussed. The Setup for Triangle Rasterization Anders Kugler University of Tübingen  Computer Graphics Laboratory (1) (1) Universität Tübingen WilhelmSchickardInstitut für Informatik GraphischInteraktive Systeme Auf der Morgenstelle 10 D72076 Tübingen  Germany email: kugler@gris.unit...
Reducing Division Latency with Reciprocal Caches
 Reliable Computing
, 1996
"... Introduction Floatingpoint division has received increasing attention in recent years. Division has a higher latency, or time required to perform a computation, than addition or multiplication. While division is an infrequent operation even in floatingpoint intensive applications, its high latenc ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Introduction Floatingpoint division has received increasing attention in recent years. Division has a higher latency, or time required to perform a computation, than addition or multiplication. While division is an infrequent operation even in floatingpoint intensive applications, its high latency can result in significant system performance degradation [4]. Many methods for implementing high performance division have appeared in the literature. However, any proposed division performance enhancement should be analyzed in terms of its possible silicon area and cycle time effects. Richardson [6] discusses the technique of result caching as a means of decreasing the latency of otherwise highlatency operations, such as division. Result caching is based on recurring or redundant computations that can be found in applications. Often, one or both of the input operands for a calculation are the same as those in a previous calculation. In matrix inversion, for example, each entry mu
Analysis of the impact of different methods for division/square root computation in the performance of a superscalar microprocessor
 Journal of Systems Architecture
, 2003
"... ..."
(Show Context)
Architecture Evaluator's Work Bench and and its Application to Microprocessor Floating Point Units
, 1995
"... This paper introduces Architecture Evaluator's Workbench(AEWB), a high level design space exploration methodology, and its application to floating point units(FPUs). In applying AEWB to FPUs, a metric for optimizing and comparing floating point unit implementation is developed. The metric  FU ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper introduces Architecture Evaluator's Workbench(AEWB), a high level design space exploration methodology, and its application to floating point units(FPUs). In applying AEWB to FPUs, a metric for optimizing and comparing floating point unit implementation is developed. The metric  FUPA incorporates four aspects of AEWB  latency, cost, technology and profiles of target applications. FUPA models latency in terms of delay, cost in terms of area, and profile in terms of percentage of different floating point operations. We utilize submicron device models, interconnect models, and actual microprocessor scaling data to develop models used to normalize both latency and area enabling technologyindependent comparison of implementations. This report also surveyed most of the state of the art microprocessors, and compared them utilizing FUPA. Finally, we correlate the FUPA results to reported SPECfp92 results, and demonstrate the effect of circuit density on FUPA implementations. ...
An Area/Performance Comparison of Subtractive and Multiplicative Divide/Square Root Implementations
 Proc. 12th IEEE Symp. Computer Arithmetic
, 1995
"... The implementations of division and square root in the FPU's of current microprocessors are based on one of two categories of algorithms. Multiplicative techniques, exemplified by the NewtonRaphson method and Goldschmidt 's algorithm, share functionality with the floatingpoint multiplier. ..."
Abstract
 Add to MetaCart
(Show Context)
The implementations of division and square root in the FPU's of current microprocessors are based on one of two categories of algorithms. Multiplicative techniques, exemplified by the NewtonRaphson method and Goldschmidt 's algorithm, share functionality with the floatingpoint multiplier. Subtractive methods, such as the many variations of radix4 SRT, generally use dedicated, parallel hardware. These different approaches give rise to the distinct area and performance characteristics which are explored in this paper. Area comparisons are derived from measurements of commercial and academic hardware implementations. Representative divide/square root implementations are paired with typical addmultiply structures and simulated, using data from current microprocessor and arithmetic coprocessor designs, to obtain performance estimates. The results suggest that subtractive implementations offer a superior balance of area and performance, and stand to benefit most decisively from improvemen...
Area And Performance Tradeoffs In FloatingPoint Divide And Square Root Implementations
, 1994
"... ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, N ..."
Abstract
 Add to MetaCart
(Show Context)
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 8690481, or permissions@acm.org. AREA AND PERFORMANCE TRADEOFFS IN FLOATINGPOINT DIVIDE AND SQUARE ROOT IMPLEMENTATIONS Peter Soderquist Miriam Leeser School of Electrical Engineering Dept. of Electrical and Computer Engineering Cornell University Northeastern University Ithaca, New York 14853 Boston, Massachusetts 02115 Email: pgs@cs.cornell.edu Email: mel@ece.neu.edu Abstract Floatingpoint divide and square root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floatingpoint standard. Yet many current microprocessors provide only weak support for these operations. Th...