Results 1 
7 of
7
Modular Verification of SRT Division
, 1996
"... . We describe a formal specification and verification in PVS for the general theory of SRT division, and for the hardware design of a specific implementation. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to be deve ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
. We describe a formal specification and verification in PVS for the general theory of SRT division, and for the hardware design of a specific implementation. The specification demonstrates how attributes of the PVS language (in particular, predicate subtypes) allow the general theory to be developed in a readable manner that is similar to textbook presentations, while the PVS table construct allows direct specification of the implementation's quotient lookup table. Verification of the derivations in the SRT theory and for the data path and lookup table of the implementation are highly automated and performed for arbitrary, but finite precision; in addition, the theory is verified for general radix, while the implementation is specialized to radix 4. The effectiveness of the automation derives from PVS's tight integration of rewriting with decision procedures for equality, linear arithmetic over integers and rationals, and propositional logic. This example demonstrates t...
On Division And Reciprocal Caches
, 1995
"... Floatingpoint division is generally regarded as a high latency operation in typical floatingpoint applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the laten ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Floatingpoint division is generally regarded as a high latency operation in typical floatingpoint applications. Many techniques exist for increasing division performance, often at the cost of increasing either chip area, cycle time, or both. This paper presents two methods for decreasing the latency of division. Using applications from the SPECfp92 and NAS benchmark suites, these methods are evaluated to determine their effects on overall system performance. The notion of recurring computation is presented, and it is shown how recurring division can be exploited using an additional, dedicated division cache. Additionally, for multiplicationbased division algorithms, reciprocal caches can be utilized to store recurring reciprocals. Due to the similarity between the algorithms typically used to compute division and square root, the performance of square root caches is also investigated. Results show that reciprocal caches can achieve nearly a 2X reduction in effective division latency...
Reducing Division Latency with Reciprocal Caches
 Reliable Computing
, 1996
"... Introduction Floatingpoint division has received increasing attention in recent years. Division has a higher latency, or time required to perform a computation, than addition or multiplication. While division is an infrequent operation even in floatingpoint intensive applications, its high latenc ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Introduction Floatingpoint division has received increasing attention in recent years. Division has a higher latency, or time required to perform a computation, than addition or multiplication. While division is an infrequent operation even in floatingpoint intensive applications, its high latency can result in significant system performance degradation [4]. Many methods for implementing high performance division have appeared in the literature. However, any proposed division performance enhancement should be analyzed in terms of its possible silicon area and cycle time effects. Richardson [6] discusses the technique of result caching as a means of decreasing the latency of otherwise highlatency operations, such as division. Result caching is based on recurring or redundant computations that can be found in applications. Often, one or both of the input operands for a calculation are the same as those in a previous calculation. In matrix inversion, for example, each entry mu
The Setup for Triangle Rasterization
, 1996
"... Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsyste ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsystem. A setup architecture handling generalized triangle meshes and computing all necessary parameters for a highend raster pipeline to generate Gouraud shaded, texture and bumpmapped triangles is described and its benefits on the final bandwidth are shown. To efficiently compute the slopes and color gradients for each triangle, some implementation aspects on division and multiplication pipelines are discussed. The Setup for Triangle Rasterization Anders Kugler University of Tübingen  Computer Graphics Laboratory (1) (1) Universität Tübingen WilhelmSchickardInstitut für Informatik GraphischInteraktive Systeme Auf der Morgenstelle 10 D72076 Tübingen  Germany email: kugler@gris.unit...
Architecture Evaluator's Work Bench and and its Application to Microprocessor Floating Point Units
, 1995
"... This paper introduces Architecture Evaluator's Workbench(AEWB), a high level design space exploration methodology, and its application to floating point units(FPUs). In applying AEWB to FPUs, a metric for optimizing and comparing floating point unit implementation is developed. The metric  FUPA in ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper introduces Architecture Evaluator's Workbench(AEWB), a high level design space exploration methodology, and its application to floating point units(FPUs). In applying AEWB to FPUs, a metric for optimizing and comparing floating point unit implementation is developed. The metric  FUPA incorporates four aspects of AEWB  latency, cost, technology and profiles of target applications. FUPA models latency in terms of delay, cost in terms of area, and profile in terms of percentage of different floating point operations. We utilize submicron device models, interconnect models, and actual microprocessor scaling data to develop models used to normalize both latency and area enabling technologyindependent comparison of implementations. This report also surveyed most of the state of the art microprocessors, and compared them utilizing FUPA. Finally, we correlate the FUPA results to reported SPECfp92 results, and demonstrate the effect of circuit density on FUPA implementations. ...
Analysis of the Impact of Different Methods for Division/Square Root Computation in the Performance of a Superscalar Microprocessor
, 2002
"... An analysis of the impact of different methods for the doubleprecision computation of division and square root in the performance of a superscalar processor is presented in this paper. This analysis is carried out combining the SimpleScalar toolset, estimates of the latency and throughput of the co ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
An analysis of the impact of different methods for the doubleprecision computation of division and square root in the performance of a superscalar processor is presented in this paper. This analysis is carried out combining the SimpleScalar toolset, estimates of the latency and throughput of the compared methods and a set of benchmarks with typical features of intensive computing applications. Simulation results show the importance of having an efficient unit for the computation of these operations, since changes in the density of division and square root below 1% lead to changes in the performance around a 20%.