## The SNAP Project: Design of Floating Point Arithmetic Units (1997)

Venue: | In Proceedings of Arith-13 |

Citations: | 21 - 2 self |

### BibTeX

@INPROCEEDINGS{Oberman97thesnap,

author = {Stuart F. Oberman and Hesham Al-Twaijry and Michael J. Flynn},

title = {The SNAP Project: Design of Floating Point Arithmetic Units},

booktitle = {In Proceedings of Arith-13},

year = {1997},

pages = {156--165},

publisher = {IEEE}

}

### OpenURL

### Abstract

In recent years computer applications have increased in their computational complexity. The industry-wide usage of performance benchmarks, such as SPECmarks, and the popularity of 3D graphics applications forces processor designers to pay particular attention to implementation of the floating point unit, or FPU. This paper presents results of the Stanford subnanosecond arithmetic processor (SNAP) research effort in the design of hardware for floating point addition, multiplication and division. We show that one cycle FP addition is achievable 32% of the time using a variable latency algorithm. For multiplication, a binary tree is often inferior to a Wallace-tree designed using an algorithmic layout approach for contemporary feature sizes (0.3m). Further, in most cases two-bit Booth encoding of the multiplier is preferable to non-Booth encoding for partial product generation. It appears that for division, optimum area-performance is achieved using functional iteration, ...

### Citations

750 | ATOM: a system for building customized program analysis tools
- Srivastava, Eustace
- 1994
(Show Context)
Citation Context ...+ Swap Predict CLOSE FAR Collision Logic + Tri-state Output Collision Logic + Tri-State Output Collision Logic + Tri-state Output Figure 2. One, two, or three cycle variable latency adder tion system =-=[6]-=-. ATOM was used to instrument 10 applications from the SPECfp92 [7] benchmark suite which were then executed on a DEC Alpha 3000/500 workstation. The benchmarks used the standard input data sets. All ... |

239 |
A suggestion for a fast multiplier
- Wallace
- 1964
(Show Context)
Citation Context ...cts and how the partial products are added together to produce the final result. Research on multiplier design has included techniques for partial product generation [9] and partial product reduction =-=[10]-=-, [11], [12], [13], [14]. Most previous analyses of the partial product reduction trees use as the basis for their design a simple compressor delay model where the delay from each input of a compresso... |

117 |
Some schemes for parallel multipliers
- Dadda
- 1965
(Show Context)
Citation Context ...d how the partial products are added together to produce the final result. Research on multiplier design has included techniques for partial product generation [9] and partial product reduction [10], =-=[11]-=-, [12], [13], [14]. Most previous analyses of the partial product reduction trees use as the basis for their design a simple compressor delay model where the delay from each input of a compressor to e... |

70 |
A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach
- Oklobdzija, Villeger, et al.
- 1996
(Show Context)
Citation Context ...hat takes into account the interconnect delay due to counter placement and the different path delays, is extremely useful. We have implemented such an algorithm, based upon the approach of Oklobdzija =-=[15]-=-. The algorithm is essentially the same as that proposed by Oklobdzija, but it also takes into account interconnect delay due to counter placement and the different path delays. Our algorithm uses a c... |

57 |
Faithful bipartite ROM reciprocal tables
- Sarma, Matula
- 1995
(Show Context)
Citation Context ...s required for a divider using functional iteration is directly coupled to the accuracy of the initial approximation. Special tables are typically used to obtain a very accurate initial approximation =-=[16]-=-, [17]. The challenge in table design is to maximize the approximation accuracy while minimizing its total size. As a result, this continues to be a very important active area of research. A VLA techn... |

28 |
Exploiting trivial and redundant computation
- Richardson
- 1993
(Show Context)
Citation Context ...tinues to be a very important active area of research. A VLA technique that can be applied to functional iteration is the use of reciprocal caches. The use of result caches is discussed by Richardson =-=[18]-=-. The use of a small reciprocal cache for an integer divider is discussed in [19]. It has been shown that a reciprocal cache is an efficient technique of reducing average floating point division laten... |

22 |
On the design of high performance digital arithmetic units
- Farmwald
- 1981
(Show Context)
Citation Context ...tial large aligning right shift is required. This allows the aligning right shift and the normalizing left-shift to be mutually exclusive, with only one such shift ever appearing on the critical path =-=[5]-=-. Another optimization made in this algorithm reduces the number of serial operations. In a straightforward implementation of the addition dataflow, rounding would be implemented by a separate series ... |

15 |
A 1,000,000 transistor micro- processor
- Kohn, Fu
- 1989
(Show Context)
Citation Context ...umes double precision operands. A block diagram of a state-of-the-art FP adder is shown in Fig. 1. Adders similar to this architecture have been implemented in several commercial microprocessors [2], =-=[3]-=-, [4]. This architecture exploits many aspects of the FP addition dataflow. It implements the significand datapath in Rshift Lshift LOP PENC ComAdd ComAdd HalfAdd MUX Exp Diff + Swap + Swap Predict CL... |

14 |
Rounding quadratically converging algorithms for division and square root
- Schwarz
- 1995
(Show Context)
Citation Context ...er to produce exactly rounded quotients. AVLA technique that can be used to reduce the latency penalty involves keeping several extra guard bits in an appropriately biased pre-rounded result. Schwarz =-=[20]-=- proposes using 1 additional guard bit in the pre-rounded result. In this way, the back-multiplication and subtraction is only required in half of the cases on average. An extension to this technique ... |

13 |
Efficient initial ap- proximation and fast converging methods for division and square root
- Ito, Takagi, et al.
- 1995
(Show Context)
Citation Context ...ired for a divider using functional iteration is directly coupled to the accuracy of the initial approximation. Special tables are typically used to obtain a very accurate initial approximation [16], =-=[17]-=-. The challenge in table design is to maximize the approximation accuracy while minimizing its total size. As a result, this continues to be a very important active area of research. A VLA technique t... |

10 |
High speed arithmetic in binary computers
- McSorley
- 1961
(Show Context)
Citation Context ...ow they generate the partial products and how the partial products are added together to produce the final result. Research on multiplier design has included techniques for partial product generation =-=[9]-=- and partial product reduction [10], [11], [12], [13], [14]. Most previous analyses of the partial product reduction trees use as the basis for their design a simple compressor delay model where the d... |

8 |
4-2 carry-save adder implementation using send circuits
- Shen, Weinberger
- 1978
(Show Context)
Citation Context ...the partial products are added together to produce the final result. Research on multiplier design has included techniques for partial product generation [9] and partial product reduction [10], [11], =-=[12]-=-, [13], [14]. Most previous analyses of the partial product reduction trees use as the basis for their design a simple compressor delay model where the delay from each input of a compressor to each ou... |

8 |
A pipelined 64X64b iterative array multiplier,” in Digest of Technical Pa- pers
- Santoro, Horowitz
- 1988
(Show Context)
Citation Context ...rtial products are added together to produce the final result. Research on multiplier design has included techniques for partial product generation [9] and partial product reduction [10], [11], [12], =-=[13]-=-, [14]. Most previous analyses of the partial product reduction trees use as the basis for their design a simple compressor delay model where the delay from each input of a compressor to each output i... |

5 |
et al., "UltraSPARC: The Next Generation Superscalar 64-bit SPARC
- Greenley
(Show Context)
Citation Context ...s assumes double precision operands. A block diagram of a state-of-the-art FP adder is shown in Fig. 1. Adders similar to this architecture have been implemented in several commercial microprocessors =-=[2]-=-, [3], [4]. This architecture exploits many aspects of the FP addition dataflow. It implements the significand datapath in Rshift Lshift LOP PENC ComAdd ComAdd HalfAdd MUX Exp Diff + Swap + Swap Predi... |

5 |
An analysis of floating-point addition
- Sweeney
- 1965
(Show Context)
Citation Context ...of the operations are in the FAR path and require three cycles, while 43% are in the CLOSE path and require at most two cycles. A comparison with a different study of floating point addition operands =-=[8]-=- on a much different architecture using different applications provides validation for these results. In that study over 30 years ago, six problems were traced on an IBM 704, tracking the aligning and... |

4 |
et al. A Dual--execution pipelined floating--point
- Kowaleski
- 1996
(Show Context)
Citation Context ...uble precisionsoperands.sA block diagram of a state-of-the-art FP adder is shownsin Fig. 1. Adders similar to this architecture have been im-splemented in several commercial microprocessors [2], [3],s=-=[4]-=-. This architecture exploits many aspects of the Fp ad-sdition dataflow. It implements the significand datapath insFAR CLOSEsExp Diff Predict klsRshifts0 HalfAddsComAdd 0sLshift 3'sFigure 1. Three cyc... |

2 |
et al., "A 4.4 ns CMOS 54*54-b multiplier using pass-transistor multiplexor
- Ohkubo
- 1995
(Show Context)
Citation Context ...products are added together to produce the final result. Research on multiplier design has included techniques for partial product generation [9] and partial product reduction [10], [11], [12], [13], =-=[14]-=-. Most previous analyses of the partial product reduction trees use as the basis for their design a simple compressor delay model where the delay from each input of a compressor to each output is equa... |

1 |
et al., "A dual-execution pipelined floating-point CMOS processor
- Kowaleski
- 1996
(Show Context)
Citation Context ...double precision operands. A block diagram of a state-of-the-art FP adder is shown in Fig. 1. Adders similar to this architecture have been implemented in several commercial microprocessors [2], [3], =-=[4]-=-. This architecture exploits many aspects of the FP addition dataflow. It implements the significand datapath in Rshift Lshift LOP PENC ComAdd ComAdd HalfAdd MUX Exp Diff + Swap + Swap Predict CLOSE F... |

1 |
et al., "The design of a 64-bit integer multiplier /divider unit
- Eisig
- 1993
(Show Context)
Citation Context ...e applied to functional iteration is the use of reciprocal caches. The use of result caches is discussed by Richardson [18]. The use of a small reciprocal cache for an integer divider is discussed in =-=[19]-=-. It has been shown that a reciprocal cache is an efficient technique of reducing average floating point division latency when implementing division by functional iteration [S9]. This technique uses t... |

1 |
et al., “UltraSPARC: the next genera- tion superscalar 64-bit SPARC
- Greenley
- 1995
(Show Context)
Citation Context ...assumes double precisionsoperands.sA block diagram of a state-of-the-art FP adder is shownsin Fig. 1. Adders similar to this architecture have been im-splemented in several commercial microprocessors =-=[2]-=-, [3],s[4]. This architecture exploits many aspects of the Fp ad-sdition dataflow. It implements the significand datapath insFAR CLOSEsExp Diff Predict klsRshifts0 HalfAddsComAdd 0sLshift 3'sFigure 1.... |

1 |
On the Design of High Perj%ormance Digital Arithmetic Units
- Farmwald
- 1981
(Show Context)
Citation Context ...alslarge aligning right shift is required. This allows the align-sing right shift and the normalizing left-shift to be mutuallysexclusive, with only one such shift ever appearing on thescritical path =-=[5]-=- .sAnother optimization made in this algorithm reduces thesnumber of serial operations. In a straightforward imple-smentation of the addition dataflow, rounding would be im-splemented by a separate se... |

1 |
An analysis of floating-point addi- tion
- Release
- 1965
(Show Context)
Citation Context ...tsComAddsPENCs- ''s-s-s-s-s-sComAddsColll8lon LoglesoutputsFigure 2. One, two, or three cycle variable la-stency adderstion system [6]. ATOM was used to instrument 10 applica-stions from the SPECfp92 =-=[7]-=- benchmark suite which weresthen executed on a DEC Alpha 3000l500 workstation. Thesbenchmarks used the standard input data sets. All doublesprecision floating point addition and subtraction operations... |

1 |
Faithful bipartiteROM reciprocal tables
- DasSarma, Matula
- 1995
(Show Context)
Citation Context ...s required for a divider usingsfunctional iteration is directly coupled to the accuracy ofsthe initial approximation. Special tables are typically usedsto obtain a very accurate initial approximation =-=[16]-=-, 6171.sThe challenge in table design is to maximize the approxi-smation accuracy while minimizing its total size. As a result,sthis continues to be a very important active area of research.sA VLA tec... |

1 |
et al., “The design of a 64-bit integer mul- tiplieddivider unit
- Eisig
- 1993
(Show Context)
Citation Context ...plied to functional itera-stion is the use of reciprocal caches. The use of result cachessis discussed by Richardson [18]. The use of a small recip-srocal cache for an integer divider is discussed in =-=[19]-=-. It hassbeen shown that a reciprocal cache is an efficient techniquesof reducing average floating point division latency whensimplementing division by functional iteration [S9]. Thisstechnique uses t... |