Results 1  10
of
50
Algorithms for Parallel Memory I: TwoLevel Memories
, 1992
"... We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is n ..."
Abstract

Cited by 236 (31 self)
 Add to MetaCart
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our twolevel memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices can simultaneously transfer a contiguous block of B records. The model pertains to a largescale uniprocessor system or parallel multiprocessor system with P disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into P separate physical devices. Our algorithms' performance can be significantly better than those obtained by the wellknown but nonopti...
A Fast Fourier Transform Compiler
, 1999
"... FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the perform ..."
Abstract

Cited by 154 (5 self)
 Add to MetaCart
FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performancecritical code was generated automatically by a specialpurpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft “discovered” algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this specialpurpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.
Cacheoblivious Btrees
, 2000
"... Abstract. This paper presents two dynamic search trees attaining nearoptimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the blocktransfer size at each level, and the relative speeds of me ..."
Abstract

Cited by 133 (22 self)
 Add to MetaCart
Abstract. This paper presents two dynamic search trees attaining nearoptimal performance on any hierarchical memory. The data structures are independent of the parameters of the memory hierarchy, e.g., the number of memory levels, the blocktransfer size at each level, and the relative speeds of memory levels. The performance is analyzed in terms of the number of memory transfers between two memory levels with an arbitrary blocktransfer size of B; this analysis can then be applied to every adjacent pair of levels in a multilevel memory hierarchy. Both search trees match the optimal search bound of Θ(1+logB+1 N) memory transfers. This bound is also achieved by the classic Btree data structure on a twolevel memory hierarchy with a known blocktransfer size B. The first search tree supports insertions and deletions in Θ(1 + logB+1 N) amortized memory transfers, which matches the Btree’s worstcase bounds. The second search tree supports scanning S consecutive elements optimally in Θ(1 + S/B) memory transfers and supports insertions and deletions in Θ(1 + logB+1 N + log2 N) amortized memory transfers, matching the performance of the Btree for B = B Ω(log N log log N).
ExternalMemory Computational Geometry
, 1993
"... In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important largescale problems. We discuss our algor ..."
Abstract

Cited by 121 (20 self)
 Add to MetaCart
In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important largescale problems. We discuss our algorithms primarily in the contex't of single processor/single disk machines, a domain in which they are not only the first known optimal results but also of tremendous practical value. Our methods also produce the first known optimal algorithms for a wide range of twolevel and hierarchical muir{level memory models, including parallel models. The algorithms are optimal both in terms of I/0 cost and internal computation.
GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management
, 2006
"... We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We ..."
Abstract

Cited by 104 (10 self)
 Add to MetaCart
We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We therefore exploit both the highbandwidth GPU memory interface and the lowerbandwidth CPU main memory interface and achieve higher memory bandwidth than purely CPUbased algorithms. GPUTeraSort is a twophase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. It also pipelines disk transfers and achieves nearpeak I/O performance. We have tested the performance of GPUTeraSort on billionrecord files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPUbased algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort priceperformance. These results suggest that a GPU coprocessor can significantly improve performance on large data processing tasks. 1.
CacheOblivious Algorithms
, 1999
"... This thesis presents "cacheoblivious" algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cac ..."
Abstract

Cited by 78 (1 self)
 Add to MetaCart
This thesis presents "cacheoblivious" algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z =# (L 2 ), the number of cache misses for an m × n matrix transpose is #(1 + mn=L). The number of cache misses for either an npoint FFT or the sorting of n numbers is #(1 + (n=L)(1 + log Z n)). The cache complexity of computing n ...
Cacheoblivious priority queue and graph algorithm applications
 In Proc. 34th Annual ACM Symposium on Theory of Computing
, 2002
"... In this paper we develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hi ..."
Abstract

Cited by 64 (10 self)
 Add to MetaCart
In this paper we develop an optimal cacheoblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cacheoblivious data structure, M and B are not used in the description of the structure. The bounds match the bounds of several previously developed externalmemory (cacheaware) priority queue data structures, which all rely crucially on knowledge about M and B. Priority queues are a critical component in many of the best known externalmemory graph algorithms, and using our cacheoblivious priority queue we develop several cacheoblivious graph algorithms.
Faster Integer Multiplication
 STOC'07
, 2007
"... For more than 35 years, the fastest known method for integer multiplication has been the SchönhageStrassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complex ..."
Abstract

Cited by 43 (0 self)
 Add to MetaCart
For more than 35 years, the fastest known method for integer multiplication has been the SchönhageStrassen algorithm running in time O(n log n log log n). Under certain restrictive conditions there is a corresponding Ω(n log n) lower bound. The prevailing conjecture has always been that the complexity of an optimal algorithm is Θ(n log n). We present a major step towards closing the gap from above by presenting an algorithm running in time n log n 2 O(log ∗ n). The main result is for boolean circuits as well as for multitape Turing machines, but it has consequences to other models of computation as well.
Efficient ExternalMemory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oeffic ..."
Abstract

Cited by 38 (12 self)
 Add to MetaCart
In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oefficient algorithms through the design of I/Oefficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (onedimensional) rangetree and the segmenttree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms