## Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet Transform (2006)

Venue: | In Proc. 3rd ACM Int. Conf. on Computing Frontiers |

Citations: | 5 - 4 self |

### BibTeX

@INPROCEEDINGS{Shahbahrami06improvingthe,

author = {Asadollah Shahbahrami and Ben Juurlink and Stamatis Vassiliadis},

title = {Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet Transform},

booktitle = {In Proc. 3rd ACM Int. Conf. on Computing Frontiers},

year = {2006},

pages = {253--260}

}

### OpenURL

### Abstract

The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a row-major layout) induces many cache misses, due to lack of spatial locality. This can be avoided by interchanging the loops. This paper shows, however, that the resulting implementation suffers significantly from 64K aliasing, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart, and we propose two techniques to avoid it. In addition, if the filter length is longer than four, the number of ways of the L1 data cache of the Pentium 4 is insufficient to avoid cache conflict misses. Consequently, we propose two methods for reducing conflict misses. Although experimental results have been collected on the Pentium 4, the techniques are general and can be applied to other processors with different cache organizations as well. The proposed techniques improve the performance of vertical filtering compared to already optimized baseline implementations by a factor of 3.11 for the (5, 3) lifting scheme, 3.11 for Daubechies ’ transform of four coefficients, and by a factor of 1.99 for the Cohen, Daubechies, and Feauveau 9/7 transform.

### Citations

1653 | Orthonormal bases of compactly supported wavelets
- Daubechies
- 1988
(Show Context)
Citation Context ...ferent filters are considered in this paper, namely the (5, 3) lifting scheme [9, 21], Daubechies’ transform with four coefficients [22] (Daub-4), and the Cohen, Daubechies and Feauveau 9/7 transform =-=[8]-=- (CDF-9/7). The reasons for considering these filters are (1) the lifting and CDF-9/7 transforms are included in Part 1 of the JPEG2000 standard [15], and (2) these transforms have been considered in ... |

462 | Factoring wavelet transforms into lifting steps
- Daubechies, Sweldens
- 1998
(Show Context)
Citation Context ... therefore, focus on the implementation of the 2D DWT on generalpurpose processors, in particular the Pentium 4. Three different filters are considered in this paper, namely the (5, 3) lifting scheme =-=[9, 21]-=-, Daubechies’ transform with four coefficients [22] (Daub-4), and the Cohen, Daubechies and Feauveau 9/7 transform [8] (CDF-9/7). The reasons for considering these filters are (1) the lifting and CDF-... |

449 |
lifting scheme: a custom-design construction of biorthogonal wavelets
- Sweldens
- 1996
(Show Context)
Citation Context ... therefore, focus on the implementation of the 2D DWT on generalpurpose processors, in particular the Pentium 4. Three different filters are considered in this paper, namely the (5, 3) lifting scheme =-=[9, 21]-=-, Daubechies’ transform with four coefficients [22] (Daub-4), and the Cohen, Daubechies and Feauveau 9/7 transform [8] (CDF-9/7). The reasons for considering these filters are (1) the lifting and CDF-... |

240 |
Wavelets for Computer Graphics Theory and Applications
- Salesin
- 1996
(Show Context)
Citation Context ...requency bands each contain N/2 samples. With the correct choice of filters, this operation is reversible. This process decomposes the original image into two sub-bands: the lower and the higher band =-=[20]-=-. This transform can be extended to multiple dimensions by using separable filters. A 2D DWT can be performed by first performing a 1D DWT on each row (horizontal filtering) of the image followed by a... |

70 | An overview of JPEG-2000
- Marcellin, Gormish, et al.
- 2000
(Show Context)
Citation Context ...Keywords: Discrete Wavelet Transform, memory hierarchy, cache, performance. 1. INTRODUCTION The wavelet transform is mainly used for image and video compression. Standards such as MPEG-4 and JPEG2000 =-=[13, 15]-=- are based on the 2D discrete wavelet transform (DWT). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies a... |

67 |
Computer System: A Programmer’s Perspective
- Bryant, O’Hallaron
- 2007
(Show Context)
Citation Context ... Performance was measured using the IA-32 cycle counter [12]. Cycle counters provide a very precise tool for measuring the time that elapses between two different points in the execution of a program =-=[2, 19]-=-. In order to eliminate the effects of context switching and compulsory cache misses, the Kbest measurement scheme and a warmed up cache have been used [2]. That means the function is repeatedly (K ti... |

36 |
Line-based, reduced memory, wavelet image compression
- Chrysafis, Ortega
- 2000
(Show Context)
Citation Context ...tput coefficient. Chaver et al. [4, 5] also considered the memory hierarchy issue but also vectorized the 2D DWT using an SIMD extension. They proposed combining aggregation with a linebased approach =-=[7]-=-, which starts vertical filtering as soon as assufficient number of lines (determined by the filter lengths) has been filtered horizontally. This approach reduces the amount of memory required. In add... |

32 | JPEG 2000: the upcoming still image compression standard,” Pattern Recogn
- Skodras, Christopoulos, et al.
- 2001
(Show Context)
Citation Context ...5 ...$5.00. 253 The JPEG2000 compression standard has been created to provide higher compression ratios than JPEG. It can be very time-consuming, however. For example, simulation results presented in =-=[18]-=- show that JPEG2000 encoding can take up to 34 times longer than JPEG encoding. Furthermore, results presented in [3, 14] show that the DWT consumes 40 to 60% of the JPEG2000 encoding time. One way to... |

13 | 2-D wavelet transform enhancement on general-purpose microprocessors: Memory hierarchy and SIMD parallelism exploitation
- Chaver, Tenllado, et al.
- 2002
(Show Context)
Citation Context ...sition level is reported. Both strip-mining and recursive data layout do not remove the conflicts that may exist between the input coefficients needed to compute one output coefficient. Chaver et al. =-=[4, 5]-=- also considered the memory hierarchy issue but also vectorized the 2D DWT using an SIMD extension. They proposed combining aggregation with a linebased approach [7], which starts vertical filtering a... |

13 | Tirado; Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions
- Chaver, Tenllado, et al.
- 2003
(Show Context)
Citation Context ...r considering these filters are (1) the lifting and CDF-9/7 transforms are included in Part 1 of the JPEG2000 standard [15], and (2) these transforms have been considered in many recent papers (e.g., =-=[1, 10, 16, 5, 22, 6]-=-). A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a ro... |

12 | Cache Issues with JPEG2000 Wavelet Lifting
- Meerwald, Norcen, et al.
- 2002
(Show Context)
Citation Context ...can be very time-consuming, however. For example, simulation results presented in [18] show that JPEG2000 encoding can take up to 34 times longer than JPEG encoding. Furthermore, results presented in =-=[3, 14]-=- show that the DWT consumes 40 to 60% of the JPEG2000 encoding time. One way to reduce the execution time of the DWT is by developing special-purpose hardware. Programmable processors, however, are pr... |

11 |
An Overview of the JPEG2000
- Rannani, Joshi
- 2010
(Show Context)
Citation Context ...Keywords: Discrete Wavelet Transform, memory hierarchy, cache, performance. 1. INTRODUCTION The wavelet transform is mainly used for image and video compression. Standards such as MPEG-4 and JPEG2000 =-=[13, 15]-=- are based on the 2D discrete wavelet transform (DWT). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies a... |

10 | Measuring Execution Time and Real-Time Performance
- Stewart
- 2002
(Show Context)
Citation Context ... Performance was measured using the IA-32 cycle counter [12]. Cycle counters provide a very precise tool for measuring the time that elapses between two different points in the execution of a program =-=[2, 19]-=-. In order to eliminate the effects of context switching and compulsory cache misses, the Kbest measurement scheme and a warmed up cache have been used [2]. That means the function is repeatedly (K ti... |

7 | Performance comparison of SIMD implementations of the discrete wavelet transform
- Shahbahrami, Juurlink, et al.
(Show Context)
Citation Context ...etween the input coefficients needed to compute one output coefficient if the filter length is larger than the number of cache ways. Previous work has focused mainly on improving spatial locality. In =-=[17]-=- we considered various ways to implement the 2D DWT using MMX instructions. In this work we observed that the performance improvement provided by MMX varies significantly depending on the image size. ... |

6 | A Memory System Supporting the Efficient SIMD Computation of the Two Dimensional DWT
- Trenas, Lopez, et al.
- 1998
(Show Context)
Citation Context ...on generalpurpose processors, in particular the Pentium 4. Three different filters are considered in this paper, namely the (5, 3) lifting scheme [9, 21], Daubechies’ transform with four coefficients =-=[22]-=- (Daub-4), and the Cohen, Daubechies and Feauveau 9/7 transform [8] (CDF-9/7). The reasons for considering these filters are (1) the lifting and CDF-9/7 transforms are included in Part 1 of the JPEG20... |

5 |
Cache-efficient wavelet lifting in jpeg 2000,” Multimedia and Expo
- Chatterjee, Brooks
- 2002
(Show Context)
Citation Context ...can be very time-consuming, however. For example, simulation results presented in [18] show that JPEG2000 encoding can take up to 34 times longer than JPEG encoding. Furthermore, results presented in =-=[3, 14]-=- show that the DWT consumes 40 to 60% of the JPEG2000 encoding time. One way to reduce the execution time of the DWT is by developing special-purpose hardware. Programmable processors, however, are pr... |

4 | Reducing 3D Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions
- Bernabé, García, et al.
(Show Context)
Citation Context ...r considering these filters are (1) the lifting and CDF-9/7 transforms are included in Part 1 of the JPEG2000 standard [15], and (2) these transforms have been considered in many recent papers (e.g., =-=[1, 10, 16, 5, 22, 6]-=-). A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a ro... |

2 |
The Parallel Algorithm of 2-D Discrete Wavelet Transform
- He, Zhang
- 2003
(Show Context)
Citation Context ...r considering these filters are (1) the lifting and CDF-9/7 transforms are included in Part 1 of the JPEG2000 standard [15], and (2) these transforms have been considered in many recent papers (e.g., =-=[1, 10, 16, 5, 22, 6]-=-). A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a ro... |

1 |
Embedded Vector Processor Architecture for Real-Time Wavelet Video Compression
- Shafer
- 2004
(Show Context)
Citation Context |