Results 1 - 10
of
36
Integral histogram: A fast way to extract histograms in cartesian spaces
- in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
, 2005
"... We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distrince advantages: 1- It is computationally superior to the conventional approach. The integral histogram method makes it po ..."
Abstract
-
Cited by 87 (6 self)
- Add to MetaCart
We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distrince advantages: 1- It is computationally superior to the conventional approach. The integral histogram method makes it possible to employ even an exhaustive search process in real-time, which was impractical before. 2- It can be extended to higher data dimensions, uniform and non-uniform bin formations, and multiple target scales with out sacrificing its computational advantages. 3-It enables the description of high level histogram features. We exploit the spatial arrangement of data points, and recursively propagate an aggregated histogram by starting from the origin and traversing through the remaining points along either a scan-line or a wave-front. At each step, we update a single bin using the values of integral histogram at the previously visited neighboring data points. After the integral histogram is propagated, histogram of any target region can be computed easily by using simple arithmetic operations.
ARCHITECTURE-AWARE CLASSICAL TAYLOR SHIFT BY 1
, 2005
"... We present algorithms that outperform straightforward implementations of classical Taylor shift by 1. For input polynomials of low degrees a method of the SACLIB library is faster than straightforward implementations by a factor of at least 2; for higher degrees we develop a method that is faster th ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
We present algorithms that outperform straightforward implementations of classical Taylor shift by 1. For input polynomials of low degrees a method of the SACLIB library is faster than straightforward implementations by a factor of at least 2; for higher degrees we develop a method that is faster than straightforward implementations by a factor of up to 7. Our Taylor shift algorithm requires more word additions than straightforward implementations but it reduces the number of cycles per word addition by reducing memory tra c and the number of carry computations. The introduction of signed digits, suspended normalization, radix reduction, and delayed carry propagation enables our algorithm to take advantage of the technique of register tiling which is commonly used by optimizing compilers. While our algorithm is written in a high-level language, it depends on several parameters that can be tuned to the underlying architecture.
The Etree Library: A System for Manipulating Large Octrees on Disk
, 2003
"... This report describes a library, called the etree library, that allows C programmers to manipulate large octrees stored on disk. Octrees are stored as a sequence of fixed sized octant records sorted by a locational code order that is equivalent to a preorder traversal of the tree and a Z-order trave ..."
Abstract
-
Cited by 12 (11 self)
- Add to MetaCart
This report describes a library, called the etree library, that allows C programmers to manipulate large octrees stored on disk. Octrees are stored as a sequence of fixed sized octant records sorted by a locational code order that is equivalent to a preorder traversal of the tree and a Z-order traversal through the domain. The sorted records are indexed by a conventional file-resident B-tree index and queried using fixed-length locational code keys. A schema can be defined to make an etree portable across different platforms. The etree library provides functions for creating, modifying, and searching octrees, including efficient mechanisms for appending octants and iterating over octants in Z-order. The library is the foundation for a larger research effort aimed at enabling scientists and engineers to solve large physical simulations on their desktop systems by recasting the simulation process to work directly on large etrees stored on disk.
Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform
- in Proc. 16th IEEE Int. Conf. on Application Specific Systems Architectures and Processors (ASAP
, 2005
"... This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms considered are Daubechies ’ real-to-real method of four coefficients (Daub-4) and the integer-to-integer (5, 3) lifting scheme. Daub-4 is implemented using SSE and the lifting scheme using MMX, and ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms considered are Daubechies ’ real-to-real method of four coefficients (Daub-4) and the integer-to-integer (5, 3) lifting scheme. Daub-4 is implemented using SSE and the lifting scheme using MMX, and their performance is compared to C implementations on a Pentium 4 processor. The MMX implementation of the lifting scheme is up to 4.0x faster than the corresponding C program for a 1-level 2D DWT, while the SSE implementation of Daub-4 is up to 2.6x faster than the C version. It is shown that for some image sizes, the performance is significantly hampered by the so-called 64K aliasing problem, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart. It is also shown that for the (5, 3) lifting scheme, a 12-bit word size is sufficient for a 5-level decomposition of the 2D DWT for images of up to 10 bits per pixel.
An adaptive orthogonal frequency division multiplexing baseband modem for wideband wireless channels,” master’s thesis
, 2006
"... This thesis shows the design of an Orthogonal Frequency Division Multiplexing baseband modem with Frequency Adaptive Modulation protocol for a wideband indoor wireless channel. The baseband modem is implemented on a Field Programmable Gate Array and uses 294,939 2-input NAND gates with a clock frequ ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This thesis shows the design of an Orthogonal Frequency Division Multiplexing baseband modem with Frequency Adaptive Modulation protocol for a wideband indoor wireless channel. The baseband modem is implemented on a Field Programmable Gate Array and uses 294,939 2-input NAND gates with a clock frequency of 128 MHz. The Frequency Adaptive Modulation algorithm is 6 % of the entire baseband modem which means that it is of low complexity. The baseband modem is then integrated with a RF Front End. The maximum transmit power of the RF Front End is 7.5 dBm. This prototype takes 128 MHz of bandwidth and divides it into 128 1-MHz bins. The carrier frequency is at 5.25 GHz. Measurements are taken with this prototype to investigate the concept of Frequency Adaptive Modulation. With a target uncoded Bit Error Rate of 10 −3, it is found at distances of 1.0m to 10.8m, the data rate varies from 355 Mbps to 10 Mbps. The average data rate of this system is 2.57 times the average data rate without Frequency Adaptive Modulation. The fact that a Rayleigh channel is decomposed into Gaussian
Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet Transform
- In Proc. 3rd ACM Int. Conf. on Computing Frontiers
, 2006
"... The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filterin ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a row-major layout) induces many cache misses, due to lack of spatial locality. This can be avoided by interchanging the loops. This paper shows, however, that the resulting implementation suffers significantly from 64K aliasing, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart, and we propose two techniques to avoid it. In addition, if the filter length is longer than four, the number of ways of the L1 data cache of the Pentium 4 is insufficient to avoid cache conflict misses. Consequently, we propose two methods for reducing conflict misses. Although experimental results have been collected on the Pentium 4, the techniques are general and can be applied to other processors with different cache organizations as well. The proposed techniques improve the performance of vertical filtering compared to already optimized baseline implementations by a factor of 3.11 for the (5, 3) lifting scheme, 3.11 for Daubechies ’ transform of four coefficients, and by a factor of 1.99 for the Cohen, Daubechies, and Feauveau 9/7 transform.
Efficient Vectorization of the FIR Filter
- In Proc. 16th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC
, 2005
"... Abstract—The Finite Impulse Response (FIR) filter is one of the most important digital signal processing (DSP) kernels. It performs filtering of speech signals in modern voice coders such as the ETSI GSM EFR/AMR or ITU G.729, as well as in many other signal processing areas. Many contemporary digita ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—The Finite Impulse Response (FIR) filter is one of the most important digital signal processing (DSP) kernels. It performs filtering of speech signals in modern voice coders such as the ETSI GSM EFR/AMR or ITU G.729, as well as in many other signal processing areas. Many contemporary digital signal processors as well as general-purpose microprocessors employ SIMD instructions to exploit the data-level parallelism present in media kernels such as the FIR filter. An important question is, therefore, how can the FIR filter be effectively vectorized to exploit the SIMD capabilities of the architecture? In this paper the performance of different methods for vectorizing the FIR filter such as vectorizing the inner loop and vectorizing the outer loop algorithms using C programming and SIMD instructions are compared. Additionally, we present another method to vectorize the FIR filter. It vectorizes the inner loop as well as the outer loop. All of these methods suffer from misaligned memory accesses. To overcome this problem we use four copies of filter coefficients that are aligned to 8-byte in memory. Our results show that the MMX implementation that vectorizes the inner loop is up to 3.34 times faster than the corresponding C implementation. Furthermore, the MMX implementation that vectorizes the inner loop as well as the outer loop and avoids misaligned memory accesses is up to 2.2 times faster than the MMX implementation that only vectorizes the inner loop. Finally, MMX implementation that vectorizes both loops and also avoids misaligned memory accesses is up to 1.69 times faster than the version that does not avoid misaligned memory accesses. Keywords—FIR, SIMD, Multimedia applications, data reuse. I.
Term-Level Verification of a Pipelined CISC Microprocessor
, 2005
"... By abstracting the details of the data representations and operations in a microprocessor, term-level verification can formally prove that a pipelined microprocessor faithfully implements its sequential, instruction-set architecture specification. Previous efforts in this area have focused on reduce ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
By abstracting the details of the data representations and operations in a microprocessor, term-level verification can formally prove that a pipelined microprocessor faithfully implements its sequential, instruction-set architecture specification. Previous efforts in this area have focused on reduced instruction set computer (RISC) and very-large instruction word (VLIW) processors. This work reports

