Results 1  10
of
25
Tablebased polynomials for fast hardware function evaluation
 16th IEEE International Conference on ApplicationSpecific Systems, Architectures, and Processors (ASAP’05
, 2005
"... Many general tablebased methods for the evaluation in hardware of elementary functions have been published. The bipartite and multipartite methods implement a firstorder approximation of the function using only table lookups and additions. Recently, a singlemultiplier secondorder method of simil ..."
Abstract

Cited by 35 (9 self)
 Add to MetaCart
(Show Context)
Many general tablebased methods for the evaluation in hardware of elementary functions have been published. The bipartite and multipartite methods implement a firstorder approximation of the function using only table lookups and additions. Recently, a singlemultiplier secondorder method of similar inspiration has also been published. This paper extends such methods to approximations of arbitrary order, using adders, small multipliers, and very small adhoc powering units. We obtain implementations that are both smaller and faster than previously published approaches. This paper also deals with the FPGA implementation of such methods. Previous work have consistently shown that increasing the approximation degree lead to not only smaller but also faster designs, as the reduction of the table size meant a reduction of its lookup time, which compensated for the addition and multiplication time. The experiments in this paper suggest that this still holds when going from order 2 to order 3, but no longer when using higherorder approximations, where a tradeoff appears. 1.
A compact and accurate Gaussian variate generator
 IEEE Trans. Very Large Scale Integr. (VLSI) Syst
, 2008
"... Abstract—A compact, fast, and accurate realization of a digital Gaussian variate generator (GVG) based on the Box–Muller algorithm is presented. The proposed GVG has a faster Gaussian sample generation rate and higher tail accuracy with a lower hardware cost than published designs. The GVG design ..."
Abstract

Cited by 16 (10 self)
 Add to MetaCart
(Show Context)
Abstract—A compact, fast, and accurate realization of a digital Gaussian variate generator (GVG) based on the Box–Muller algorithm is presented. The proposed GVG has a faster Gaussian sample generation rate and higher tail accuracy with a lower hardware cost than published designs. The GVG design can be readily configured to achieve arbitrary tail accuracy (i.e., with a proposed 16bit datapath up to 15 times the standard deviation) with only small variations in hardware utilization, and without degrading the output sample rate. Polynomial curve fitting is utilized along with a hybrid (i.e., combination of logarithmic and uniform) segmentation and a scaling scheme to maintain accuracy. A typical instantiation of the proposed GVG occupies only 534 configurable slices, two onchip block memories, and three dedicated multipliers of the Xilinx VirtexII XC2V40006 fieldprogrammable gate array (FPGA) and operates at 248 MHz, generating 496 million Gaussian variates (GVs) per second within a range of 6 66. To accurately achieve a range of 9 4, the GVG uses 852 configurable slices, three block memories, and three onchip dedicated multipliers of the same FPGA while still operating at 248 MHz, generating 496 million GVs per second. The core area and performance of a GVG implemented in a 90nm CMOS technology are also given. The statistical characteristics of the GVG are evaluated and confirmed using multiple standard statistical goodnessoffit tests. Index Terms—Box–Muller (BM) algorithm, fieldprogrammable gate array (FPGA), Gaussian noise generator (GNG), low biterror rate simulation, random number generation. I.
Numerical Function Generators Using LUT Cascades
"... Abstract—This paper proposes an architecture and a synthesis method for highspeed computation of fixedpoint numerical functions such as trigonometric, logarithmic, sigmoidal, square root, and combinations of these functions. Our architecture is based on the lookup table (LUT) cascade, which result ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper proposes an architecture and a synthesis method for highspeed computation of fixedpoint numerical functions such as trigonometric, logarithmic, sigmoidal, square root, and combinations of these functions. Our architecture is based on the lookup table (LUT) cascade, which results in a significant reduction in circuit complexity compared to traditional approaches. This is suitable for automatic synthesis and we show a synthesis method that converts a Matlablike specification into an LUT cascade design. Experimental results show the efficiency of our approach as implemented on a fieldprogrammable gate array (FPGA). Index Terms—LUT cascades, numerical function generators (NFGs), nonuniform segmentation, automatic synthesis, FPGA implementation. 1
A Flexible Architecture for Precise Gamma Correction
, 1998
"... Abstract—We present a flexible hardware architecture for precise gamma correction via piecewise linear polynomial approximations. Arbitrary gamma values, input bit widths, and output bit widths are supported. The gamma correction curve is segmented via a combination of uniform segments and segments ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract—We present a flexible hardware architecture for precise gamma correction via piecewise linear polynomial approximations. Arbitrary gamma values, input bit widths, and output bit widths are supported. The gamma correction curve is segmented via a combination of uniform segments and segments whose sizes vary by powers of two. This segmentation method minimizes the number of segments required, while providing an efficient way for indexing the polynomial coefficients. The outputs are guaranteed to be accurate to one unit in the last place through an analytical bitwidth analysis methodology. Hardware realizations of various gamma correction designs are demonstrated on a Xilinx Virtex4 fieldprogrammable gate array (FPGA). A pipelined 12bit input/8bit output design on an XC4VLX10012 FPGA occupies 146 slices and one digital signal processing slice. It is capable of performing 378 million gamma correction operations per second. Index Terms—Displays, field programmable gate arrays (FPGAs), fixedpoint arithmetic, video signal processing. I.
Design method for numerical function generators using recursive segmentation and EVBDDs
 IEICE Trans. Fundamentals
, 2007
"... This paper focuses on numerical function generators (NFGs) based on kth order polynomial approximations. We show that increasing the polynomial order k reduces signicantly the NFG’s memory size. However, larger k requires more logic elements and multipliers. To quantify this tradeoff, we introduce ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
This paper focuses on numerical function generators (NFGs) based on kth order polynomial approximations. We show that increasing the polynomial order k reduces signicantly the NFG’s memory size. However, larger k requires more logic elements and multipliers. To quantify this tradeoff, we introduce the FPGA utilization measure, and then determine the optimum polynomial order k. Experimental results show that: 1) for low accuracies (up to 17 bits), 1st order polynomial approximations produce the most efcient implementations; and 2) for higher accuracies (18 to 24 bits), 2ndorder polynomial approximations produce the most efcient implementations. 1.
A New Hardware Efficient Inversion Based Random Number Generator for NonUniform Distributions
 in Reconfigurable Computing and FPGAs (ReConFig), 2010 International Conference on
, 2010
"... Abstract—For numerous computationally complex applications, like financial modelling and Monte Carlo simulations, the fast generation of high quality nonuniform random numbers (RNs) is essential. The implementation of such generators in FPGAbased accelerators has therefore become a very active re ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract—For numerous computationally complex applications, like financial modelling and Monte Carlo simulations, the fast generation of high quality nonuniform random numbers (RNs) is essential. The implementation of such generators in FPGAbased accelerators has therefore become a very active research field. In this paper we present a novel approach to create RNs for different distributions based on an efficient transformation of floatingpoint inputs. For the Gaussian distribution we can reduce the number of slices needed by up to 48 % compared to the stateoftheart while achieving a higher output precision in the tail region. Our architecture produces samples up to 8.37 휎 and achieves 381MHz. We also present a comprehensive testing methodology based on stochastic analysis and verification in practical applications.
Numerical function generators using edgevalued binary decision diagrams
 in ASPDAC2007
"... Abstract — In this paper, we introduce the edgevalued binary decision diagram (EVBDD) to reduce the memory and delay in numerical function generators (NFGs). An NFG realizes a function, such as a trigonometric, logarithmic, square root, or reciprocal function, in hardware. NFGs are important in, ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper, we introduce the edgevalued binary decision diagram (EVBDD) to reduce the memory and delay in numerical function generators (NFGs). An NFG realizes a function, such as a trigonometric, logarithmic, square root, or reciprocal function, in hardware. NFGs are important in, for example, digital signal applications, where high speed and accuracy are necessary. We use the EVBDD to produce a fast and compact segment index encoder (SIE) that is a key component in our NFG. We compare our approach with NFG designs based on multiterminal BDD’s (MTBDDs), and show that the EVBDD produces SIEs that have, on average, only 7 % of the memory and 40 % of the delay of those designed using MTBDDs. Therefore, our NFGs based on EVBDDs have, on average, only 38 % of the memory and 59 % of the delay of NFGs based on MTBDDs. I.
Adaptive range reduction for hardware function evaluation
 In Proc. IEEE Int’l Conf. on FieldProgrammable Technology
, 2004
"... Function evaluation f(x) typically consists of range reduction and the actual function evaluation on a small interval. In this paper, we investigate optimization of range reduction given the range and precision of x and f(x). For every function evaluation there exists a convenient interval such as [ ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Function evaluation f(x) typically consists of range reduction and the actual function evaluation on a small interval. In this paper, we investigate optimization of range reduction given the range and precision of x and f(x). For every function evaluation there exists a convenient interval such as [0,π/2) for sin(x). The adaptive range reduction method, which we propose in this work, involves deciding whether range reduction can be used effectively for a particular design. The decision depends on the function being evaluated, precision, and optimization metrics such as area, latency and throughput. In addition, the input and output range has an impact on the preferable function evaluation method such as polynomial, tablebased, or combinations of the two. We explore this vast design space of adaptive range reduction for fixedpoint sin(x), log(x) and √ x accurate to one unit in the last place using MATLAB and ASC, A Stream Compiler. These tools enable us to study over 1000 designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours ’ time. The final objective is to progress towards a fully automated library that provides optimal function evaluation hardware units given input/output range and precision. 1
Hardware Implementation TradeOffs of Polynomial Approximations and Interpolations
 IEEE TRANSACTIONS ON COMPUTERS
, 2008
"... Abstract—This paper examines the hardware implementation tradeoffs when evaluating functions via piecewise polynomial approximations and interpolations for precisions of up to 24 bits. In polynomial approximations, polynomials are evaluated using stored coefficients. Polynomial interpolations, howe ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—This paper examines the hardware implementation tradeoffs when evaluating functions via piecewise polynomial approximations and interpolations for precisions of up to 24 bits. In polynomial approximations, polynomials are evaluated using stored coefficients. Polynomial interpolations, however, require the coefficients to be computed onthefly by using stored function values. Although it is known that interpolations require less memory than approximations, but at the expense of additional computations, the tradeoffs in memory, area, delay, and power consumption between the two approaches have not been examined in detail. This work quantitatively analyzes these tradeoffs for optimized approximations and interpolations across different functions and target precisions. Hardware architectures for degree1 and degree2 approximations and interpolations are described. The results show that the extent of memory savings realized by using interpolation is significantly lower than what is commonly believed. Furthermore, experimental results on a fieldprogrammable gate array (FPGA) show that, for high output precision, degree1 interpolations offer considerable area and power savings over degree1 approximations, but similar savings are not realized when degree2 interpolations and approximations are compared. The availability of both interpolationbased and approximationbased designs offers a richer set of design tradeoffs than what is available using either interpolation or approximation alone. Index Terms—Algorithms implemented in hardware, interpolation, approximation, VLSI systems. Ç 1
Dinechin, “Second order function approximation using a single multiplication on FPGAs
 Proc. Inter. Conf. on Field Programmable Logic and Applications (FPL’04
, 2004
"... Abstract. This paper presents a new scheme for the hardware evaluation of elementary functions, based on a piecewise second order minimax approximation. The novelty is that this evaluation requires only one small rectangular multiplication. Therefore the resulting architecture combines a small table ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. This paper presents a new scheme for the hardware evaluation of elementary functions, based on a piecewise second order minimax approximation. The novelty is that this evaluation requires only one small rectangular multiplication. Therefore the resulting architecture combines a small table size, thanks to secondorder evaluation, with a short critical path: Consisting of one table lookup, the rectangular multiplication, and one addition, the critical path is shorter than that of a plain firstorder evaluation. Synthesis results for several functions show that this method outperforms all the previously published methods in both area and speed for precisions ranging from 12 to 24 bits and over. 1