Results 1  10
of
67
Integral histogram: A fast way to extract histograms in cartesian spaces
 in Proc. IEEE Conf. on Computer Vision and Pattern Recognition
, 2005
"... We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distrince advantages: 1 It is computationally superior to the conventional approach. The integral histogram method makes it po ..."
Abstract

Cited by 144 (14 self)
 Add to MetaCart
We present a novel method, which we refer as an integral histogram, to compute the histograms of all possible target regions in a Cartesian data space. Our method has three distrince advantages: 1 It is computationally superior to the conventional approach. The integral histogram method makes it possible to employ even an exhaustive search process in realtime, which was impractical before. 2 It can be extended to higher data dimensions, uniform and nonuniform bin formations, and multiple target scales with out sacrificing its computational advantages. 3It enables the description of high level histogram features. We exploit the spatial arrangement of data points, and recursively propagate an aggregated histogram by starting from the origin and traversing through the remaining points along either a scanline or a wavefront. At each step, we update a single bin using the values of integral histogram at the previously visited neighboring data points. After the integral histogram is propagated, histogram of any target region can be computed easily by using simple arithmetic operations.
ARCHITECTUREAWARE CLASSICAL TAYLOR SHIFT BY 1
, 2005
"... We present algorithms that outperform straightforward implementations of classical Taylor shift by 1. For input polynomials of low degrees a method of the SACLIB library is faster than straightforward implementations by a factor of at least 2; for higher degrees we develop a method that is faster th ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
We present algorithms that outperform straightforward implementations of classical Taylor shift by 1. For input polynomials of low degrees a method of the SACLIB library is faster than straightforward implementations by a factor of at least 2; for higher degrees we develop a method that is faster than straightforward implementations by a factor of up to 7. Our Taylor shift algorithm requires more word additions than straightforward implementations but it reduces the number of cycles per word addition by reducing memory tra c and the number of carry computations. The introduction of signed digits, suspended normalization, radix reduction, and delayed carry propagation enables our algorithm to take advantage of the technique of register tiling which is commonly used by optimizing compilers. While our algorithm is written in a highlevel language, it depends on several parameters that can be tuned to the underlying architecture.
The Etree Library: A System for Manipulating Large Octrees on Disk
, 2003
"... This report describes a library, called the etree library, that allows C programmers to manipulate large octrees stored on disk. Octrees are stored as a sequence of fixed sized octant records sorted by a locational code order that is equivalent to a preorder traversal of the tree and a Zorder trave ..."
Abstract

Cited by 17 (15 self)
 Add to MetaCart
This report describes a library, called the etree library, that allows C programmers to manipulate large octrees stored on disk. Octrees are stored as a sequence of fixed sized octant records sorted by a locational code order that is equivalent to a preorder traversal of the tree and a Zorder traversal through the domain. The sorted records are indexed by a conventional fileresident Btree index and queried using fixedlength locational code keys. A schema can be defined to make an etree portable across different platforms. The etree library provides functions for creating, modifying, and searching octrees, including efficient mechanisms for appending octants and iterating over octants in Zorder. The library is the foundation for a larger research effort aimed at enabling scientists and engineers to solve large physical simulations on their desktop systems by recasting the simulation process to work directly on large etrees stored on disk.
Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform
 in Proc. 16th IEEE Int. Conf. on Application Specific Systems Architectures and Processors (ASAP
, 2005
"... This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms considered are Daubechies ’ realtoreal method of four coefficients (Daub4) and the integertointeger (5, 3) lifting scheme. Daub4 is implemented using SSE and the lifting scheme using MMX, and ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
This paper focuses on SIMD implementations of the 2D discrete wavelet transform (DWT). The transforms considered are Daubechies ’ realtoreal method of four coefficients (Daub4) and the integertointeger (5, 3) lifting scheme. Daub4 is implemented using SSE and the lifting scheme using MMX, and their performance is compared to C implementations on a Pentium 4 processor. The MMX implementation of the lifting scheme is up to 4.0x faster than the corresponding C program for a 1level 2D DWT, while the SSE implementation of Daub4 is up to 2.6x faster than the C version. It is shown that for some image sizes, the performance is significantly hampered by the socalled 64K aliasing problem, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart. It is also shown that for the (5, 3) lifting scheme, a 12bit word size is sufficient for a 5level decomposition of the 2D DWT for images of up to 10 bits per pixel.
Learning Conditional Abstractions
"... Abstract—Abstraction is central to formal verification. In termlevel abstraction, the design is abstracted using a fragment of firstorder logic with background theories, such as the theory of uninterpreted functions with equality. The main challenge in using termlevel abstraction is determining w ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Abstract—Abstraction is central to formal verification. In termlevel abstraction, the design is abstracted using a fragment of firstorder logic with background theories, such as the theory of uninterpreted functions with equality. The main challenge in using termlevel abstraction is determining what components to abstract and under what conditions. In this paper, we present an automatic technique to conditionally abstract register transfer level (RTL) hardware designs to the term level. Our approach is a layered approach that combines random simulation and machine learning inside a counterexample guided abstraction refinement (CEGAR) loop. First, random simulation is used to determine modules that are candidates for abstraction. Next, machine learning is used on the resulting simulation traces to generate candidate conditions under which those modules can be abstracted. Finally, a verifier is invoked. If spurious counterexamples arise, we refine the abstraction by performing a further iteration of random simulation and machine learning. We present an experimental evaluation on processor designs. I.
An adaptive orthogonal frequency division multiplexing baseband modem for wideband wireless channels,” master’s thesis
, 2006
"... This thesis shows the design of an Orthogonal Frequency Division Multiplexing baseband modem with Frequency Adaptive Modulation protocol for a wideband indoor wireless channel. The baseband modem is implemented on a Field Programmable Gate Array and uses 294,939 2input NAND gates with a clock frequ ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
This thesis shows the design of an Orthogonal Frequency Division Multiplexing baseband modem with Frequency Adaptive Modulation protocol for a wideband indoor wireless channel. The baseband modem is implemented on a Field Programmable Gate Array and uses 294,939 2input NAND gates with a clock frequency of 128 MHz. The Frequency Adaptive Modulation algorithm is 6 % of the entire baseband modem which means that it is of low complexity. The baseband modem is then integrated with a RF Front End. The maximum transmit power of the RF Front End is 7.5 dBm. This prototype takes 128 MHz of bandwidth and divides it into 128 1MHz bins. The carrier frequency is at 5.25 GHz. Measurements are taken with this prototype to investigate the concept of Frequency Adaptive Modulation. With a target uncoded Bit Error Rate of 10 −3, it is found at distances of 1.0m to 10.8m, the data rate varies from 355 Mbps to 10 Mbps. The average data rate of this system is 2.57 times the average data rate without Frequency Adaptive Modulation. The fact that a Rayleigh channel is decomposed into Gaussian
Toddler: Detecting Performance Problems via Similar MemoryAccess Patterns
"... Abstract—Performance bugs are programming errors that create significant performance degradation. While developers often use automated oracles for detecting functional bugs, detecting performance bugs usually requires timeconsuming, manual analysis of execution profiles. The human effort for perfor ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract—Performance bugs are programming errors that create significant performance degradation. While developers often use automated oracles for detecting functional bugs, detecting performance bugs usually requires timeconsuming, manual analysis of execution profiles. The human effort for performance analysis limits the number of performance tests analyzed and enables performance bugs to easily escape to production. Unfortunately, while profilers can successfully localize slow executing code, profilers cannot be effectively used as automated oracles. This paper presents TODDLER, a novel automated oracle for performance bugs, which enables testing for performance bugs to use the well established and automated process of testing for functional bugs. TODDLER reports code loops whose computation has repetitive and partially similar memoryaccess patterns across loop iterations. Such repetitive work is likely unnecessary and can be done faster. We implement TODDLER for Java and evaluate it on 9 popular Java codebases. Our experiments with 11 previously known, realworld performance bugsshow that TODDLERfindsthese bugswithahigheraccuracy than the standard Java profiler. Using TODDLER, we also found 42 new bugs in six Java projects: Ant, Google Core Libraries, JUnit, Apache Collections, JDK, and JFreeChart. Based on our bug reports, developers so far fixed 10 bugs and confirmed 6 more as real bugs. I.
ATLAS: Automatic TermLevel Abstraction of RTL Designs
"... Abstract—Abstraction plays a central role in formal verification. Termlevel abstraction is a technique for abstracting wordlevel terms, functional blocks with uninterpreted functions, and memories with a suitable theory of memories. A major challenge for any abstraction technique is to determine w ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Abstract—Abstraction plays a central role in formal verification. Termlevel abstraction is a technique for abstracting wordlevel terms, functional blocks with uninterpreted functions, and memories with a suitable theory of memories. A major challenge for any abstraction technique is to determine what components can be safely abstracted. We present an automatic technique for termlevel abstraction of hardware designs, in the context of equivalence and refinement checking problems. Our approach is hybrid, involving a combination of random simulation and static analysis. We use random simulation to identify functional blocks that are suitable for abstraction with uninterpreted functions. Static analysis is then used to compute conditions under which such function abstraction is performed. The generated termlevel abstractions are verified using techniques based on Boolean satisfiability (SAT) and satisfiability modulo theories (SMT). We demonstrate our approach for verifying processor designs, interface logic, and lowpower designs. We present experimental evidence that our approach is efficient and that the resulting termlevel models are easier to verify even when the abstracted designs generate larger SAT problems. I.
Improving the Memory Behavior of Vertical Filtering in the Discrete Wavelet Transform
 In Proc. 3rd ACM Int. Conf. on Computing Frontiers
, 2006
"... The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filterin ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
The discrete wavelet transform (DWT) is used in several image and video compression standards, in particular JPEG2000. A 2D DWT consists of horizontal filtering along the rows followed by vertical filtering along the columns. It is wellknown that a straightforward implementation of vertical filtering (assuming a rowmajor layout) induces many cache misses, due to lack of spatial locality. This can be avoided by interchanging the loops. This paper shows, however, that the resulting implementation suffers significantly from 64K aliasing, which occurs in the Pentium 4 when two data blocks are accessed that are a multiple of 64K apart, and we propose two techniques to avoid it. In addition, if the filter length is longer than four, the number of ways of the L1 data cache of the Pentium 4 is insufficient to avoid cache conflict misses. Consequently, we propose two methods for reducing conflict misses. Although experimental results have been collected on the Pentium 4, the techniques are general and can be applied to other processors with different cache organizations as well. The proposed techniques improve the performance of vertical filtering compared to already optimized baseline implementations by a factor of 3.11 for the (5, 3) lifting scheme, 3.11 for Daubechies ’ transform of four coefficients, and by a factor of 1.99 for the Cohen, Daubechies, and Feauveau 9/7 transform.