Results 11  20
of
98
New Applications of the Sound Description Interchange Format
 Proc. ICMC98, Ann Arbor
, 1998
"... This paper describes the goals and design of SDIF and its standard frame types, followed by a review of recent SDIF work at CNMAT, IRCAM, and IUA. ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
This paper describes the goals and design of SDIF and its standard frame types, followed by a review of recent SDIF work at CNMAT, IRCAM, and IUA.
A Modified SplitRadix FFT With Fewer Arithmetic Operations
, 2007
"... Recent Results by Van Buskirk et al. have broken the record set by Yavne in 1968 for the lowest exact count of real additions and multiplications to compute a poweroftwo discrete Fourier transform (DFT). Here, we present a simple recursive modification of the splitradix algorithm that computes th ..."
Abstract

Cited by 24 (5 self)
 Add to MetaCart
Recent Results by Van Buskirk et al. have broken the record set by Yavne in 1968 for the lowest exact count of real additions and multiplications to compute a poweroftwo discrete Fourier transform (DFT). Here, we present a simple recursive modification of the splitradix algorithm that computes the DFT with asymptotically about 6 % fewer operations than Yavne, matching the count achieved by Van Buskirk’s programgeneration framework. We also discuss the application of our algorithm to realdata and realsymmetric (discrete cosine) transforms, where we are again able to achieve lower arithmetic counts than previously published algorithms.
A Multilevel Algorithm For Solving Boundary Integral Equation
 Micro. Opt. Tech. Lett
, 1994
"... In the solution of an integral equation using the Conjugate Gradient (CG) method, the most expansive part is the matrixvector multiplication, requiring O(N 2 ) floating point operations. The Fast Multipole Method (FMM) reduced the operation to N 1:5 . In this paper, we apply a multilevel algor ..."
Abstract

Cited by 23 (12 self)
 Add to MetaCart
In the solution of an integral equation using the Conjugate Gradient (CG) method, the most expansive part is the matrixvector multiplication, requiring O(N 2 ) floating point operations. The Fast Multipole Method (FMM) reduced the operation to N 1:5 . In this paper, we apply a multilevel algorithm to this problem and show that the complexity of a matrixvector multiplication is proportional to N(log(N)) 2 . y This work was supported by NASA under contract NASA NAG 2871, Office of Naval Research under grant N0001489J1286, and the Army Research Office under contract DAAL0391G0339, and the National Science Foundation under grant NSFECS9224466. The computer time was provided by the National Center for Supercomputing Applications (NCSA) at the University of Illinois, UrbanaChampaign. Published in Micro. Opt. Tech. Lett., Vol. 7, No. 10, pp. 466470, July, 1994. File:mlfma1.tex, January 13, 1995 1. Introduction Multilevel algorithms have been used to generate fast algorit...
Oblivious algorithms for multicores and network of processors
, 2009
"... We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multic ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multicores, and we propose a multicoreoblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicoreoblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including FloydWarshall’s allpairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient networkoblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these networkoblivious algorithms perform efficiently also when executed on the DecomposableBSP.
A Tutorial on Lava: A Hardware Description and Verification System
, 2000
"... Contents 1 Introduction 4 2 Getting Started 6 2.1 Your First Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The Lava Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Your Second Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Generating VHDL . . ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Contents 1 Introduction 4 2 Getting Started 6 2.1 Your First Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 The Lava Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Your Second Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Generating VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Bigger Circuits 12 3.1 Recursion over Lists . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Connection Patterns . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Verification 18 4.1 Simple Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Quanti
Scheduling Threads for Low Space Requirement and Good Locality
 In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1999
"... The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overh ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
The running time and memory requirement of a parallel program with dynamic, lightweight threads depends heavily on the underlying thread scheduler. In this paper, we present a simple, asynchronous, spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overheads and good locality of work stealing with the low space requirements of depthfirst schedulers. For a nestedparallel program with depth D and serial space requirement S 1 , we show that the expected space requirement is S 1 +O(K \Delta p \Delta D) on p processors. Here, K is a useradjustable runtime parameter, which provides a tradeoff between running time and space requirement. Our algorithm achieves good locality and low scheduling overheads by automatically increasing the granularity of the work scheduled on each processor. We have implemented the new scheduling algorithm in the context of a native, userlevel implementation of Posix standard threads or Pthreads, and evaluated its p...
An Adaptive Software Library for Fast Fourier Transforms
 In Proceedings of the International Conference on Supercomputing
, 2000
"... In this paper we present an adaptive and portable software library for the fast Fourier transform (FFT). The library consists of a number of composable blocks of code called codelets, each computing a part of the transform. The actual FFT algorithm used by the code is determined at runtime by selec ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
In this paper we present an adaptive and portable software library for the fast Fourier transform (FFT). The library consists of a number of composable blocks of code called codelets, each computing a part of the transform. The actual FFT algorithm used by the code is determined at runtime by selecting the fastest strategy among all possible strategies, given available codelets, for a given transform size. We also presentanefficient automatic method of generating the library modules by using a specialpurpose compiler. The code generator is written in C and it generates a library of C codelets. The code generator is shown to be flexible and extensible and the entire library can be generated in a matter of seconds. Wehaveevaluated the library for performance on the IBMSP2, SGI2000, HPExemplar and Intel Pentium systems. We use the results from these evaluations to build performance models for the FFT library on different platforms. The library is shown to be portable, adaptive and efficient. 1.
Portable HighPerformance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
right notice and this permission notice are preserved on all copies.
T.: Methods for efficient, high quality volume resampling in the frequency domain
 In Proceedings of IEEE Visualization (2004
"... Resampling is a frequent task in visualization and medical imaging. It occurs whenever images or volumes are magnified, rotated, translated, or warped. Resampling is also an integral procedure in the registration of multimodal datasets, such as CT, PET, and MRI, in the correction of motion artifact ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
Resampling is a frequent task in visualization and medical imaging. It occurs whenever images or volumes are magnified, rotated, translated, or warped. Resampling is also an integral procedure in the registration of multimodal datasets, such as CT, PET, and MRI, in the correction of motion artifacts in MRI, and in the alignment of temporal volume sequences in fMRI. It is well known that the quality of the resampling result depends heavily on the quality of the interpolation filter used. However, highquality filters are rarely employed in practice due to their large spatial extents. In this paper, we explore a new resampling technique that operates in the frequencydomain where high quality filtering is feasible. Further, unlike previous methods of this kind, our technique is not limited to integerratio scaling factors, but can resample image and volume datasets at any rate. This would usually require the application of slow Discrete Fourier Transforms (DFT) to return the data to the spatial domain. We studied two methods that successfully avoid these delays: the chirpz transform and the FFTW package. We also outline techniques to avoid the ringing artifacts that may occur with frequencydomain filtering. Thus, our method can achieve highquality interpolation at speeds that are usually associated with spatial filters of far lower quality.
Cacheoblivious algorithms (Extended Abstract)
 In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cacheline length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cacheline length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an npoint FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “idealcache ” model to analyze our algorithms. We prove that an optimal cacheoblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the idealcache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cacheoblivious algorithms in practice.