Results 1  10
of
25
Cacheoblivious algorithms
, 1999
"... requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on ..."
Abstract

Cited by 90 (1 self)
 Add to MetaCart
(Show Context)
requirements for the degree of Master of Science. This thesis presents "cacheoblivious " algorithms that use asymptotically optimal amounts of work, and move data asymptotically optimally among multiple levels of cache. An algorithm is cache oblivious if no program variables dependent on hardware configuration parameters, such as cache size and cacheline length need to be tuned to minimize the number of cache misses. We show that the ordinary algorithms for matrix transposition, matrix multiplication, sorting, and Jacobistyle multipass filtering are not cache optimal. We present algorithms for rectangular matrix transposition, FFT, sorting, and multipass filters, which are asymptotically optimal on computers with multiple levels of caches. For a cache with size Z and cacheline length L, where Z = (L2), the number of cache misses for an m x n matrix transpose is E(1 + mn/L). The number of cache misses for either an npoint FFT or the sorting of n numbers is 0(1 + (n/L)(1 + logzn)). The cache complexity of computing n time steps of a Jacobistyle multipass filter on an array of size n is E(1 + n/L + n2 /ZL). We also give an 8(mnp)work algorithm to multiply an m x n matrix by an n x p matrix
Algorithms for Parallel Memory II: Hierarchical Multilevel Memories
 ALGORITHMICA
, 1993
"... In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there are P memory hierarchies operating simultaneously; communication among the hierarchies takes place ..."
Abstract

Cited by 68 (5 self)
 Add to MetaCart
In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there are P memory hierarchies operating simultaneously; communication among the hierarchies takes place at a base memory level. Our optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a twolevel memory with parallel block transfer. The probability of using l times the optimal running time is exponentially small in l(log l) log P.
Efficient External Memory Algorithms by Simulating CoarseGrained Parallel Algorithms
, 2003
"... External memory (EM) algorithms are designed for largescale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to ..."
Abstract

Cited by 46 (11 self)
 Add to MetaCart
External memory (EM) algorithms are designed for largescale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACMWorking Group on Storage I/O for LargeScale Computing.
Efficient ExternalMemory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oeffic ..."
Abstract

Cited by 37 (9 self)
 Add to MetaCart
In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oefficient algorithms through the design of I/Oefficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (onedimensional) rangetree and the segmenttree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms
Massively parallel algorithms for privatecache chip multiprocessors
, 2008
"... In this paper, we study massively parallel algorithms for privatecache chip multiprocessors (CMPs), focusing on methods for foundational problems that can scale to hundreds or even thousands of cores. By focusing on privatecache CMPs, we show that we can design efficient algorithms that need no ad ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
In this paper, we study massively parallel algorithms for privatecache chip multiprocessors (CMPs), focusing on methods for foundational problems that can scale to hundreds or even thousands of cores. By focusing on privatecache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way that cores are interconnected, for we assume that all interprocessor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. All algorithms in the paper are asymptotically optimal in terms of the parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel externalmemory (PEM) model, that formalizes the essential properties of our algorithms for privatecache chip multiprocessors. [Regular paper submission to SPAA 2008, which may be considered for a normal track or the special track on multicore systems.
ExternalMemory Algorithms with Applications in Geographic Information Systems
, 1996
"... ..."
(Show Context)
Heuristics for Scheduling I/O Operations
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... The I/O bottleneck in parallel computer systems has recently begun receiving increasing interest. Most attention has focused on improving the performance of I/O devices using fairly lowlevel parallelism in techniques such as disk striping and interleaving. Widely applicable solutions, however, will ..."
Abstract

Cited by 23 (1 self)
 Add to MetaCart
The I/O bottleneck in parallel computer systems has recently begun receiving increasing interest. Most attention has focused on improving the performance of I/O devices using fairly lowlevel parallelism in techniques such as disk striping and interleaving. Widely applicable solutions, however, will require an integrated approach which addresses the problem at multiple system levels, including applications, systems software, and architecture. We propose that within the context of such an integrated approach, scheduling parallel I/O operations will become increasingly attractive and can potentially provide substantial performance benefits. We describe a simple I/O scheduling problem and present approximate algorithms for its solution. The costs of using these algorithms in terms of execution time, and the benefits in terms of reduced time to complete a batch of I/O operations, are compared with the situations in which no scheduling is used, and in which an optimal scheduling algorithm is used. The comparison is performed both theoretically and experimentally. We have found that, in exchange for a small execution time overhead, the approximate scheduling algorithms can provide substantial improvements in I/O completion times.
Markov Analysis of MultipleDisk Prefetching Strategies for External Merging
 Theoretical Computer Science
, 1994
"... Multipledisk organizations can be used to improve the I/O performance of problems like external merging. Concurrency can be introduced by overlapping I/O requests at different disks and by prefetching additional blocks on each I/O operation. To support this prefetching, a memory cache is require ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
(Show Context)
Multipledisk organizations can be used to improve the I/O performance of problems like external merging. Concurrency can be introduced by overlapping I/O requests at different disks and by prefetching additional blocks on each I/O operation. To support this prefetching, a memory cache is required. Markov models for two prefetching strategies are developed and analyzed. Closedform expressions for the average parallelism obtainable for a given cache size and number of disks are derived for both prefetching strategies. These analytic results are confirmed by simulation. Keywords : Parallel I/O, Prefetching, Disk Cache, External Merging, Declustered Disks, Markov Chains. To appear in Theoretical Computer Science Short version in 1992 Intl. Conf. on Parallel Processing. Partially supported by an NSF Graduate Research Fellowship while at the ECE Department, Rice University. y Partially supported by NSF Research Initiation Award CCR 9010534. z Partially supported by NSF and D...
Portable HighPerformance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
right notice and this permission notice are preserved on all copies.
Optimal Parallel Sorting in MultiLevel Storage
 IN PROCEEDINGS OF THE 5TH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 1994
"... We adapt the Sharesort algorithm of Cypher and Plaxton to run on various parallel models of multilevel storage, and analyze its resulting performance. Sharesort was originally defined in the context of sorting n records on an nprocessor hypercubic network. In that context, it is not known whether ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
We adapt the Sharesort algorithm of Cypher and Plaxton to run on various parallel models of multilevel storage, and analyze its resulting performance. Sharesort was originally defined in the context of sorting n records on an nprocessor hypercubic network. In that context, it is not known whether Sharesort is asymptotically optimal. Nonetheless, we find that Sharesort achieves optimal time bounds for parallel sorting in multilevel storage, under a variety of models that have been defined in the literature.