Results 1 -
2 of
2
Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors
"... up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a pr ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
up to 68 % of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying loop interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively. Index Terms—Cache, Discrete Wavelet Transform, memory hierarchy, multimedia extensions, SIMD.
A Comparison of Two SIMD Implementations of the 2D Discrete Wavelet Transform
"... There are generally two algorithms to traverse an image to implement the 2D Discrete Wavelet Transform (DWT), namely Row-Column Wavelet Transform (RCWT) and Line-Based Wavelet Transform (LBWT). In the RCWT algorithm, the 2D DWT is divided into two 1D DWT: horizontal and vertical filtering. The horiz ..."
Abstract
- Add to MetaCart
There are generally two algorithms to traverse an image to implement the 2D Discrete Wavelet Transform (DWT), namely Row-Column Wavelet Transform (RCWT) and Line-Based Wavelet Transform (LBWT). In the RCWT algorithm, the 2D DWT is divided into two 1D DWT: horizontal and vertical filtering. The horizontal filtering processes the rows of the original image and stores the wavelet coefficients in an auxiliary matrix. Thereafter, the vertical filtering phase processes the columns of the auxiliary matrix and stores the results back in the original matrix. In the LBWT algorithm, the vertical filtering is started as soon as a sufficient number of rows, as determined by the filter length, has been horizontally processed. In this paper, we provide answers to the following questions: first, which implementation is easier to vectorize using SIMD instructions? Second, which SIMD implementation provides more performance? Our initial results for Daubechies’ transform with four coefficients show that the SIMD implementation of the LBWT algorithm is more complicated than the SIMD implementation of the RCWT algorithm, while the former algorithm is 1.60 times faster than the latter algorithm for an image of size 4096 × 4096.

