Results 11 - 20
of
54
Complex Square Root with Operand Prescaling
- in "Journal of VLSI Signal Processing
, 2006
"... prescaling. We propose a radix-r digit-recurrence algorithm for complex squareroot. The operand is prescaled to allow the selection of square-root digits by rounding of the residual. This leads to a simple hardware implementation. Moreover, the use of digit recurrence approach allows correct roundin ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
prescaling. We propose a radix-r digit-recurrence algorithm for complex squareroot. The operand is prescaled to allow the selection of square-root digits by rounding of the residual. This leads to a simple hardware implementation. Moreover, the use of digit recurrence approach allows correct rounding of the result. The algorithm, compatible with the complex division, and its design are described at a high-level. We also give rough comparisons of its latency and cost with respect to implementation based on standard floating-point instructions as used in software routines for complex square root. 1
Implementation of near Shannon Limit error-correcting codes using reconfigurable hardware
- Proc. IEEE Symp. on Field-Prog. Cust. Comput. Mach
, 2000
"... Abstract | Error correcting codes (ECCs) are widely used in digital communications. Recently, new types of ECCs have been proposed which permit error-free data transmission over noisy channels at rates which approach the Shannon capacity. For wireless communication, these new codes allow more data t ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract | Error correcting codes (ECCs) are widely used in digital communications. Recently, new types of ECCs have been proposed which permit error-free data transmission over noisy channels at rates which approach the Shannon capacity. For wireless communication, these new codes allow more data to be carried in the same spectrum, lower transmission power, and higher data security and compression. One new type of ECC, referred to as \Turbo Codes," has received a lot of attention, but is computationally expensive to decode and di cult to realize in hardware. Low Density Parity Check Codes (LDPCs), another ECC, also provide near Shannon limit error correction ability. However, LDPCs use a decoding scheme which is much more amenable to hardware implementation. This paper will rst present an overview of these coding schemes, then discuss the issues involved in building an LDPC decoder using recon gurable hardware. We present a hypothetical LDPC implementation using a commercial FPGA, which will give an idea of future research issues and performance gains.
High Performance Rotation Architectures Based On Radix-4 Cordic Algorithm
, 1997
"... Traditionally, CORDIC algorithms have employed radix-2 in the first n/2 microrotations (n is the precision in bits) in order to preserve a constant scale factor. In this work we will present a full radix-4 CORDIC algorithm in rotation mode and circular coordinates and its corresponding selection fun ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Traditionally, CORDIC algorithms have employed radix-2 in the first n/2 microrotations (n is the precision in bits) in order to preserve a constant scale factor. In this work we will present a full radix-4 CORDIC algorithm in rotation mode and circular coordinates and its corresponding selection function, and we will propose an efficient technique for the compensation of the non constant scale factor. Three radix-4 CORDIC architectures are implemented: a) a word serial architecture based on the zero skipping technique; b) a pipelined architecture; and c) an application specific architecture (the angles are known beforehand). The first two are general purpose implementations in redundant arithmetic (carry-save), whereas the last one is a simplification of the first two. The proposed architectures are time and/or area efficient when compared with already existing CORDIC architectures. 1. Introduction The CORDIC (COordinate Rotation DIgital Computer) algorithm was introduced by Volder [...
Mechanizing Verification of Arithmetic Circuits: SRT Division
- In Proc. 17th FSTTCS, volume 1346 of LNCS
, 1997
"... . The use of a rewrite-based theorem prover for verifying properties of arithmetic circuits is discussed. A prover such as Rewrite Rule Laboratory (RRL) can be used effectively for establishing numbertheoretic properties of adders, multipliers and dividers. Since verification of adders and multi ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
. The use of a rewrite-based theorem prover for verifying properties of arithmetic circuits is discussed. A prover such as Rewrite Rule Laboratory (RRL) can be used effectively for establishing numbertheoretic properties of adders, multipliers and dividers. Since verification of adders and multipliers has been discussed elsewhere in earlier papers, the focus in this paper is on a divider circuit. An SRT division circuit similar to the one used in the Intel Pentium processor is mechanically verified using RRL. The number-theoretic correctness of the division circuit is established from its equational specification. The proof is generated automatically, and follows easily using the inference procedures for contextual rewriting and a decision procedure for the quantifier-free theory of numbers (Presburger arithmetic) already implemented in RRL. Additional enhancements to rewrite-based provers such as RRL that would further facilitate verifying properties of circuits with stru...
The Setup for Triangle Rasterization
, 1996
"... Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsyste ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Integrating the slope and setup calculations for triangles to the rasterizer offloads the host processor from intensive calculations and can significantly increase 3D system performance. The processing on the host is greatly reduced and much less data is passed from the host to the graphics subsystem. A setup architecture handling generalized triangle meshes and computing all necessary parameters for a high-end raster pipeline to generate Gouraud shaded, texture- and bumpmapped triangles is described and its benefits on the final bandwidth are shown. To efficiently compute the slopes and color gradients for each triangle, some implementation aspects on division and multiplication pipelines are discussed. The Setup for Triangle Rasterization Anders Kugler University of Tübingen - Computer Graphics Laboratory (1) (1) Universität Tübingen Wilhelm-Schickard-Institut für Informatik Graphisch-Interaktive Systeme Auf der Morgenstelle 10 D-72076 Tübingen - Germany email: kugler@gris.uni-t...
RN-coding of numbers: definition and some properties
- in "Proceedings of the 17th IMACS World Congress on Scientific Computation, Applied Mathematics and Simulation
, 2004
"... Abstract — We define RN-codings as radix-signed representations of numbers for which rounding to the nearest is always identical to truncation. After giving characterizations of such representations, we investigate some of their properties, and we suggest algorithms for conversion to and from these ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Abstract — We define RN-codings as radix-signed representations of numbers for which rounding to the nearest is always identical to truncation. After giving characterizations of such representations, we investigate some of their properties, and we suggest algorithms for conversion to and from these codings.
2-D DCT Using On-Line Arithmetic
- In International Conference on Acoustics, Speech, and Signal Processing (ICASSP
, 1995
"... We present a VLSI architecture for the evaluation of the (8x8)--point 2--D DCT with on--line arithmetic. The utilization of on--line arithmetic, in combination with an algorithm based on FCT and matrix multiplication, reduces the total hardware maintaining a data rate and a latency similar to approa ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present a VLSI architecture for the evaluation of the (8x8)--point 2--D DCT with on--line arithmetic. The utilization of on--line arithmetic, in combination with an algorithm based on FCT and matrix multiplication, reduces the total hardware maintaining a data rate and a latency similar to approaches based on distributed or parallel arithmetic. The architecture has been integrated in a chip using a 1 CMOS technology, occupying an area of 56:7mm 2 . 1. INTRODUCTION The two dimensional Discrete Cosine Transform is considered an efficient technique for image compression and is being utilized as standard in several applications, including video compression, storing and transmission of still images (JPEG) and moving pictures (MPEG) and HDTV. Since direct implementation of the 2--D DCT of an NxN real matrix is computationally intensive, it is usually implemented by means of the row--column decomposition technique (separated 2--D DCT), in which the N --point 1--D DCT of each column of...
Unified Mixed Radix 2-4 Redundant Cordic Processor
, 1996
"... We present a unified mixed radix CORDIC algorithm with carry--save arithmetic and constant scale factor. The pipelined architecture of the processor is determined by a unique sequence of microrotations for the two modes of operation (rotation and vectoring) in circular and hyperbolic coordinates. ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We present a unified mixed radix CORDIC algorithm with carry--save arithmetic and constant scale factor. The pipelined architecture of the processor is determined by a unique sequence of microrotations for the two modes of operation (rotation and vectoring) in circular and hyperbolic coordinates. The combination of radix--2 and radix--4 microrotations allows us to reduce the latency and size of the pipeline significantly. The unified algorithm is based on the correcting microrotation method, which we have extended to the vectoring mode in hyperbolic coordinates. We have also generalized the use of radix--4 microrotations to the two operation modes and coordinate systems. Index Terms: Unified CORDIC algorithm, redundant arithmetic, pipelined design, high speed processor. I INTRODUCTION CORDIC is an iterative algorithm for carrying out rotations using only addition and shift operations [7] [12] [13]. The basic iteration (microrotation--extension) is [13] x i+1 = x i + moe i 2 ...
Multiprecision Division on an 8-Bit Processor
- in Proc. 13th IEEE Symp. Computer Arithmetic, IEEE CS
, 1997
"... Small processors can be especially useful in massively parallel architectures. This paper considers multiprecision division algorithms on an 8-bit processor (the Kestrel processor, currently in fabrication) that includes a small amount of memory and an 8-bit multiplier. We evaluate several variation ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Small processors can be especially useful in massively parallel architectures. This paper considers multiprecision division algorithms on an 8-bit processor (the Kestrel processor, currently in fabrication) that includes a small amount of memory and an 8-bit multiplier. We evaluate several variations of the Newton-Raphson reciprocal approximation methods for use with division. Our final singleprecision algorithm requires 41 cycles to divide two 24-bit numbers to produce a 26-bit result. The double-precision version requires 98 cycles to divide two 53-bit numbers to produce a 55-bit result. This low cycle count is the result of several techniques including low-precision arithmetic, early introduction of dividends, and simple yet good initial reciprocal estimates. 1. Introduction This paper presents a study of division on an 8-bit processor. It is motivated by the Kestrel architecture, an 8-bit parallel processor tuned to sequence analysis [8]. The word size is a natural choice for seq...
Accelerating Correctly Rounded Floating-Point Division when the Divisor Is Known in Advance
- IEEE Transactions on Computers
, 2004
"... We present techniques for accelerating the floating-point computation of x=y when y is known before x. The proposed algorithms are oriented toward architectures with available fused-mac operations. The goal is to get exactly the same result as with usual division with rounding to nearest. It is kn ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We present techniques for accelerating the floating-point computation of x=y when y is known before x. The proposed algorithms are oriented toward architectures with available fused-mac operations. The goal is to get exactly the same result as with usual division with rounding to nearest. It is known that the advanced computation of 1=y allows performing correctly rounded division in one multiplication plus two fused-macs. We show algorithms that reduce this latency to one multiplication and one fused-mac. This is achieved if a precision of at least 1 bits is available, where n is the number of mantissa bits in the target format, or if y satisfies some properties that can be easily checked at compile-time. This requires a double-word approximation of 1=y (we also show how to get it). These techniques can be used by compilers to accelerate some numerical programs without loss of accuracy.

