Results 1  10
of
75
Integral histogram: A fast way to extract histograms in cartesian spaces
 in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
, 2005
"... We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distrince advantages: 1 It is computationally superior to the conventional approach. The integral histogram method makes it po ..."
Abstract

Cited by 150 (15 self)
 Add to MetaCart
(Show Context)
We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distrince advantages: 1 It is computationally superior to the conventional approach. The integral histogram method makes it possible to employ even an exhaustive search process in realtime, which was impractical before. 2 It can be extended to higher data dimensions, uniform and nonuniform bin formations, and multiple target scales with out sacrificing its computational advantages. 3It enables the description of high level histogram features. We exploit the spatial arrangement of data points, and recursively propagate an aggregated histogram by starting from the origin and traversing through the remaining points along either a scanline or a wavefront. At each step, we update a single bin using the values of integral histogram at the previously visited neighboring data points. After the integral histogram is propagated, histogram of any target region can be computed easily by using simple arithmetic operations.
ARCHITECTUREAWARE CLASSICAL TAYLOR SHIFT BY 1
, 2005
"... We present algorithms that outperform straightforward implementations of classical Taylor shift by 1. For input polynomials of low degrees a method of the SACLIB library is faster than straightforward implementations by a factor of at least 2; for higher degrees we develop a method that is faster th ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
We present algorithms that outperform straightforward implementations of classical Taylor shift by 1. For input polynomials of low degrees a method of the SACLIB library is faster than straightforward implementations by a factor of at least 2; for higher degrees we develop a method that is faster than straightforward implementations by a factor of up to 7. Our Taylor shift algorithm requires more word additions than straightforward implementations but it reduces the number of cycles per word addition by reducing memory tra c and the number of carry computations. The introduction of signed digits, suspended normalization, radix reduction, and delayed carry propagation enables our algorithm to take advantage of the technique of register tiling which is commonly used by optimizing compilers. While our algorithm is written in a highlevel language, it depends on several parameters that can be tuned to the underlying architecture.
The Etree Library: A System for Manipulating Large Octrees on Disk
, 2003
"... This report describes a library, called the etree library, that allows C programmers to manipulate large octrees stored on disk. Octrees are stored as a sequence of fixed sized octant records sorted by a locational code order that is equivalent to a preorder traversal of the tree and a Zorder trave ..."
Abstract

Cited by 17 (15 self)
 Add to MetaCart
(Show Context)
This report describes a library, called the etree library, that allows C programmers to manipulate large octrees stored on disk. Octrees are stored as a sequence of fixed sized octant records sorted by a locational code order that is equivalent to a preorder traversal of the tree and a Zorder traversal through the domain. The sorted records are indexed by a conventional fileresident Btree index and queried using fixedlength locational code keys. A schema can be defined to make an etree portable across different platforms. The etree library provides functions for creating, modifying, and searching octrees, including efficient mechanisms for appending octants and iterating over octants in Zorder. The library is the foundation for a larger research effort aimed at enabling scientists and engineers to solve large physical simulations on their desktop systems by recasting the simulation process to work directly on large etrees stored on disk.
Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform
 in Proc. 16th IEEE Int. Conf. on Application Specific Systems Architectures and Processors (ASAP
, 2005
"... This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms considered are Daubechies ’ realtoreal method of four coefficients (Daub4) and the integertointeger (5, 3) lifting scheme. Daub4 is implemented using SSE and the lifting scheme using MMX, and ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms considered are Daubechies ’ realtoreal method of four coefficients (Daub4) and the integertointeger (5, 3) lifting scheme. Daub4 is implemented using SSE and the lifting scheme using MMX, and their performance is compared to C implementations on a Pentium 4 processor. The MMX implementation of the lifting scheme is up to 4.0x faster than the corresponding C program for a 1level 2D DWT, while the SSE implementation of Daub4 is up to 2.6x faster than the C version. It is shown that for some image sizes, the performance is significantly hampered by the socalled 64K aliasing problem, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart. It is also shown that for the (5, 3) lifting scheme, a 12bit word size is sufficient for a 5level decomposition of the 2D DWT for images of up to 10 bits per pixel.
Toddler: Detecting Performance Problems via Similar MemoryAccess Patterns
"... Abstract—Performance bugs are programming errors that create significant performance degradation. While developers often use automated oracles for detecting functional bugs, detecting performance bugs usually requires timeconsuming, manual analysis of execution profiles. The human effort for perfor ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Performance bugs are programming errors that create significant performance degradation. While developers often use automated oracles for detecting functional bugs, detecting performance bugs usually requires timeconsuming, manual analysis of execution profiles. The human effort for performance analysis limits the number of performance tests analyzed and enables performance bugs to easily escape to production. Unfortunately, while profilers can successfully localize slow executing code, profilers cannot be effectively used as automated oracles. This paper presents TODDLER, a novel automated oracle for performance bugs, which enables testing for performance bugs to use the well established and automated process of testing for functional bugs. TODDLER reports code loops whose computation has repetitive and partially similar memoryaccess patterns across loop iterations. Such repetitive work is likely unnecessary and can be done faster. We implement TODDLER for Java and evaluate it on 9 popular Java codebases. Our experiments with 11 previously known, realworld performance bugsshow that TODDLERfindsthese bugswithahigheraccuracy than the standard Java profiler. Using TODDLER, we also found 42 new bugs in six Java projects: Ant, Google Core Libraries, JUnit, Apache Collections, JDK, and JFreeChart. Based on our bug reports, developers so far fixed 10 bugs and confirmed 6 more as real bugs. I.
Learning Conditional Abstractions
"... Abstract—Abstraction is central to formal verification. In termlevel abstraction, the design is abstracted using a fragment of firstorder logic with background theories, such as the theory of uninterpreted functions with equality. The main challenge in using termlevel abstraction is determining w ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Abstraction is central to formal verification. In termlevel abstraction, the design is abstracted using a fragment of firstorder logic with background theories, such as the theory of uninterpreted functions with equality. The main challenge in using termlevel abstraction is determining what components to abstract and under what conditions. In this paper, we present an automatic technique to conditionally abstract register transfer level (RTL) hardware designs to the term level. Our approach is a layered approach that combines random simulation and machine learning inside a counterexample guided abstraction refinement (CEGAR) loop. First, random simulation is used to determine modules that are candidates for abstraction. Next, machine learning is used on the resulting simulation traces to generate candidate conditions under which those modules can be abstracted. Finally, a verifier is invoked. If spurious counterexamples arise, we refine the abstraction by performing a further iteration of random simulation and machine learning. We present an experimental evaluation on processor designs. I.
Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet Transform
 In Proc. 3rd ACM Int. Conf. on Computing Frontiers
, 2006
"... The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filterin ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a rowmajor layout) induces many cache misses, due to lack of spatial locality. This can be avoided by interchanging the loops. This paper shows, however, that the resulting implementation suffers significantly from 64K aliasing, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart, and we propose two techniques to avoid it. In addition, if the filter length is longer than four, the number of ways of the L1 data cache of the Pentium 4 is insufficient to avoid cache conflict misses. Consequently, we propose two methods for reducing conflict misses. Although experimental results have been collected on the Pentium 4, the techniques are general and can be applied to other processors with different cache organizations as well. The proposed techniques improve the performance of vertical filtering compared to already optimized baseline implementations by a factor of 3.11 for the (5, 3) lifting scheme, 3.11 for Daubechies ’ transform of four coefficients, and by a factor of 1.99 for the Cohen, Daubechies, and Feauveau 9/7 transform.
Efficient Vectorization of the FIR Filter
 In Proc. 16th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC
, 2005
"... Abstract—The Finite Impulse Response (FIR) filter is one of the most important digital signal processing (DSP) kernels. It performs filtering of speech signals in modern voice coders such as the ETSI GSM EFR/AMR or ITU G.729, as well as in many other signal processing areas. Many contemporary digita ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract—The Finite Impulse Response (FIR) filter is one of the most important digital signal processing (DSP) kernels. It performs filtering of speech signals in modern voice coders such as the ETSI GSM EFR/AMR or ITU G.729, as well as in many other signal processing areas. Many contemporary digital signal processors as well as generalpurpose microprocessors employ SIMD instructions to exploit the datalevel parallelism present in media kernels such as the FIR filter. An important question is, therefore, how can the FIR filter be effectively vectorized to exploit the SIMD capabilities of the architecture? In this paper the performance of different methods for vectorizing the FIR filter such as vectorizing the inner loop and vectorizing the outer loop algorithms using C programming and SIMD instructions are compared. Additionally, we present another method to vectorize the FIR filter. It vectorizes the inner loop as well as the outer loop. All of these methods suffer from misaligned memory accesses. To overcome this problem we use four copies of filter coefficients that are aligned to 8byte in memory. Our results show that the MMX implementation that vectorizes the inner loop is up to 3.34 times faster than the corresponding C implementation. Furthermore, the MMX implementation that vectorizes the inner loop as well as the outer loop and avoids misaligned memory accesses is up to 2.2 times faster than the MMX implementation that only vectorizes the inner loop. Finally, MMX implementation that vectorizes both loops and also avoids misaligned memory accesses is up to 1.69 times faster than the version that does not avoid misaligned memory accesses. Keywords—FIR, SIMD, Multimedia applications, data reuse. I.
TermLevel Verification of a Pipelined CISC Microprocessor
, 2005
"... By abstracting the details of the data representations and operations in a microprocessor, termlevel verification can formally prove that a pipelined microprocessor faithfully implements its sequential, instructionset architecture specification. Previous efforts in this area have focused on reduce ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
By abstracting the details of the data representations and operations in a microprocessor, termlevel verification can formally prove that a pipelined microprocessor faithfully implements its sequential, instructionset architecture specification. Previous efforts in this area have focused on reduced instruction set computer (RISC) and verylarge instruction word (VLIW) processors. This work reports