Results 1  10
of
333
Good features to track
, 1994
"... No featurebased vision system can work unless good features can be identified and tracked from frame to frame. Although tracking itself is by and large a solved problem, selecting features that can be tracked well and correspond to physical points in the world is still hard. We propose a feature se ..."
Abstract

Cited by 1519 (13 self)
 Add to MetaCart
No featurebased vision system can work unless good features can be identified and tracked from frame to frame. Although tracking itself is by and large a solved problem, selecting features that can be tracked well and correspond to physical points in the world is still hard. We propose a feature selection criterion that is optimal by construction because it is based on how the tracker works, and a feature monitoring method that can detect occlusions, disocclusions, and features that do not correspond to points in the world. These methods are based on a new tracking algorithm that extends previous NewtonRaphson style search methods to work under affine image transformations. We test performance with several simulations and experiments.
The design and implementation of FFTW3
 Proceedings of the IEEE
, 2005
"... FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our cu ..."
Abstract

Cited by 414 (3 self)
 Add to MetaCart
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with handoptimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for realdata DFTs of prime size, a new way of implementing DFTs by means of machinespecific singleinstruction, multipledata (SIMD) instructions, and how a specialpurpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. Keywords—Adaptive software, cosine transform, fast Fourier transform (FFT), Fourier transform, Hartley transform, I/O tensor.
SPIRAL: Code Generation for DSP Transforms
 PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
, 2005
"... Abstract — Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital sig ..."
Abstract

Cited by 144 (31 self)
 Add to MetaCart
Abstract — Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performancecritical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domainspecific mathematical structure of transform algorithms to implement a feedbackdriven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code. Index Terms — library generation, code optimization, adaptation, automatic performance tuning, high performance computing, linear signal transform, discrete Fourier transform, FFT, discrete cosine transform, wavelet, filter, search, learning, genetic and evolutionary algorithm, Markov decision process I.
The Uniform Memory Hierarchy Model of Computation
 Algorithmica
, 1992
"... The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performancerelevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use im ..."
Abstract

Cited by 113 (9 self)
 Add to MetaCart
The Uniform Memory Hierarchy (UMH) model introduced in this paper captures performancerelevant aspects of the hierarchical nature of computer memory. It is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved datamovement strategies. A sequential computer's memory is modelled as a sequence hM 0 ; M 1 ; :::i of increasingly large memory modules. Computation takes place in M 0 . Thus, M 0 might model a computer's central processor, while M 1 might be cache memory, M 2 main memory, and so on. For each module M U , a bus B U connects it with the next larger module M U+1 . All buses may be active simultaneously. Data is transferred along a bus in fixedsized blocks. The size of these blocks, the time required to transfer a block, and the number of blocks that fit in a module are larger for modules farther from the processor. The UMH model is parameterized by the rate at which the blocksizes i...
FFTs for the 2SphereImprovements and Variations
 JOURNAL OF FOURIER ANALYSIS AND APPLICATIONS
, 2003
"... Earlier work by Driscoll and Healy [18] has produced an efficient algorithm for computing the Fourier transform of bandlimited functions on the 2sphere. In this article we present a reformulation and variation of the original algorithm which results in a greatly improved inverse transform, and co ..."
Abstract

Cited by 107 (3 self)
 Add to MetaCart
Earlier work by Driscoll and Healy [18] has produced an efficient algorithm for computing the Fourier transform of bandlimited functions on the 2sphere. In this article we present a reformulation and variation of the original algorithm which results in a greatly improved inverse transform, and consequent improved convolution algorithm for such functions. All require at most O(N log2 N)operations where N is the number of sample points. We also address implementation considerations and give heuristics for allowing reliable and computationally efficient floating point implementations of slightly modified algorithms. These claims are supported by extensive numerical experiments from our implementation in C on DEC, HP, SGI and Linux Pentium platforms. These results indicate that variations of the algorithm are both reliable and efficient for a large range of useful problem sizes. Performance appears to be architecturedependent. The article concludes with a brief discussion of a few potential applications.
BSPlib: The BSP Programming Library
, 1998
"... BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming ..."
Abstract

Cited by 84 (6 self)
 Add to MetaCart
BSPlib is a small communications library for bulk synchronous parallel (BSP) programming which consists of only 20 basic operations. This paper presents the full definition of BSPlib in C, motivates the design of its basic operations, and gives examples of their use. The library enables programming in two distinct styles: direct remote memory access using put or get operations, and bulk synchronous message passing. Currently, implementations of BSPlib exist for a variety of modern architectures, including massively parallel computers with distributed memory, shared memory multiprocessors, and networks of workstations. BSPlib has been used in several scientific and industrial applications; this paper briefly describes applications in benchmarking, Fast Fourier Transforms, sorting, and molecular dynamics.
SPL: A Language and Compiler for DSP Algorithms
, 2001
"... We discuss the design and implementation of a compiler that translates formulas representing signal processing transforms into ecient C or Fortran programs. The formulas are represented in a language that we call SPL, an acronym from Signal Processing Language. The compiler is a component of the SPI ..."
Abstract

Cited by 83 (11 self)
 Add to MetaCart
We discuss the design and implementation of a compiler that translates formulas representing signal processing transforms into ecient C or Fortran programs. The formulas are represented in a language that we call SPL, an acronym from Signal Processing Language. The compiler is a component of the SPIRAL system which makes use of formula transformations and intelligent search strategies to automatically generate optimized digital signal processing (DSP) libraries. After a discussion of the translation and optimization techniques implemented in the compiler, we use SPL formulations of the fast Fourier transform (FFT) to evaluate the compiler. Our results show that SPIRAL, which can be used to implement many classes of algorithms, produces programs that perform as well as \hardwired" systems like FFTW.
Working Sets, Cache Sizes, and Node Granularity Issues for LargeScale Multiprocessors
 In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... The distribution of resources among processors, memory and caches is a crucial question faced by designers of largescale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a s ..."
Abstract

Cited by 73 (4 self)
 Add to MetaCart
The distribution of resources among processors, memory and caches is a crucial question faced by designers of largescale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for costeffectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of largescale parallel scientific applications. We first show that all the applications have a hierarchy of welldefined perprocessor working sets, whose size, performance impact and scaling characteristics can help determine how large diffkrent levels of a multiprocessor 's cache hierarchy should be. Then, we use these working sets together with certain other imporant characteristics of the applications such as communication to computation ratios, concurrency, and load balancing behavioto reflect upon the broader question of the granularity of processing nodes in highperformance multiprocessors.