Results 11  20
of
35
Optimal Parallel Sorting in MultiLevel Storage
 IN PROCEEDINGS OF THE 5TH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 1994
"... We adapt the Sharesort algorithm of Cypher and Plaxton to run on various parallel models of multilevel storage, and analyze its resulting performance. Sharesort was originally defined in the context of sorting n records on an nprocessor hypercubic network. In that context, it is not known whether ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
We adapt the Sharesort algorithm of Cypher and Plaxton to run on various parallel models of multilevel storage, and analyze its resulting performance. Sharesort was originally defined in the context of sorting n records on an nprocessor hypercubic network. In that context, it is not known whether Sharesort is asymptotically optimal. Nonetheless, we find that Sharesort achieves optimal time bounds for parallel sorting in multilevel storage, under a variety of models that have been defined in the literature.
Optimizing Parallel SPMD Programs
 In Languages and Compilers for Parallel Computing
, 1994
"... . We present compiler optimization techniques for explicitly parallel programs that communicate through a shared address space. The source programs are written in a single program multiple data (SPMD) style, and the machine target is a multiprocessor with physically distributed memory and hardware o ..."
Abstract

Cited by 15 (3 self)
 Add to MetaCart
. We present compiler optimization techniques for explicitly parallel programs that communicate through a shared address space. The source programs are written in a single program multiple data (SPMD) style, and the machine target is a multiprocessor with physically distributed memory and hardware or software support for a single address space. Unlike sequential programs or dataparallel programs, SPMD programs require cycle detection, as defined by Shasha and Snir, to perform any kind of code motion on shared variable accesses. Cycle detection finds those accesses that, if reordered by either the hardware or software, could violate sequential consistency. We improve on Shasha and Snir's algorithm for cycle detection by providing a polynomial time algorithm for SPMD programs, whereas their formulation leads to an algorithm that is exponential in the number of processors. Once cycles and local dependencies have been computed, we perform optimizations to overlap communication and computa...
Can Parallel Algorithms Enhance Serial Implementation? (Extended Abstract)
, 1996
"... The broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of th ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
The broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several farreaching outcomes: (1) Reliance on "predictability of reference" in the design of computer systems will increase. (2) Parallel algorithms will be taught as part of the standard computer science and engineering undergraduate curriculum irrespective of whether (or when) parallel processing will become ubiquitous in the generalpurpose computing world. (3) A strategic agenda for highperformance parallel computing: A multistage agenda, which in no stage compromises userfriendliness of the programmer 's...
Rectilinear Steiner Tree Minimization on a Workstation
 In Proceedings of the DIMACS Workshop on Computational Support for Discrete Mathematics
, 1992
"... : We describe a series of optimizations to Dreyfus and Wagner's dynamic program for finding a Steiner minimal tree on a graph. Our interest is in finding rectilinear Steiner minimal trees on k pins, for which the Dreyfus and Wagner algorithm runs in O(k 2 3 k ) time. The original, unoptimized, ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
: We describe a series of optimizations to Dreyfus and Wagner's dynamic program for finding a Steiner minimal tree on a graph. Our interest is in finding rectilinear Steiner minimal trees on k pins, for which the Dreyfus and Wagner algorithm runs in O(k 2 3 k ) time. The original, unoptimized, code was hopelessly I/Obound for k ? 17, even on a workstation with 16 megabytes of main memory. Our optimized code runs twenty times faster than the original code. It is not I/Obound even when run on a fast 8megabyte workstation with a slow access path to a remote disk. Our most significant optimization technique was to reorder the computation, obtaining locality of reference at all levels of the memory hierarchy. We made some improvements on the DreyfusWagner recurrences, for the rectilinear case. We developed a specialpurpose technique for compressing the data in our disk files by a factor of nine. Finally, we found it necessary to repair a subtle flaw in random(), the 4.3bsd Unix ra...
Towards a Model for Portable Parallel Performance: Exposing the Memory Hierarchy
, 1992
"... The challenge of building a program that attains high performance on a variety of parallel computers is formidable. Actually, attaining high performance on a variety of sequential computers is challenging. Indeed, its hard enough to get high performance on a single sequential computer. Constructing ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
The challenge of building a program that attains high performance on a variety of parallel computers is formidable. Actually, attaining high performance on a variety of sequential computers is challenging. Indeed, its hard enough to get high performance on a single sequential computer. Constructing a highperformance program requires detailed knowledge of the computer 's architectural features  its memory hierarchy in particular. This knowledge constitutes a detailed, albeit informal, model of computation against which the performance program is written. Similar characteristics must be considered in building a portable highperformance program but the appropriate details are elusive and often unavailable when the program is written. In order to support this type of programming, we call for a generic model. Such a model is parameterized by machine parameters. Judicious specification of these parameters results in a specific model that should capture the performancerelevant features...
Cacheoblivious algorithms (Extended Abstract)
 In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cacheline length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cacheline length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an npoint FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “idealcache ” model to analyze our algorithms. We prove that an optimal cacheoblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the idealcache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cacheoblivious algorithms in practice.
Cache oblivious algorithms
 Algorithms for Memory Hierarchies, LNCS 2625
, 2003
"... Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data st ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms of the input size, but also the movement of data optimally among the different levels of the memory hierarchy. This chapter is aimed as an introduction to the “idealcache ” model of [22] and techniques used to design cache oblivious algorithms. The chapter also presents some experimental insights and results. Part of this work was done while the author was visiting MPISaarbrücken. The
Towards an Optimal BitReversal Permutation Program
 In Proceeding of IEEE Foundations of Computer Science
, 1998
"... The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bitreversal permutation  trivial operations on a RAM  present ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
The speed of many computations is limited not by the number of arithmetic operations but by the time it takes to move and rearrange data in the increasingly complicated memory hierarchies of modern computers. Array transpose and the bitreversal permutation  trivial operations on a RAM  present nontrivial problems when designing highlytuned scientific library functions, particular for the Fast Fourier Transform. We prove a precise bound for RoCol, a simple pebbletype game that is relevant to implementing these permutations. We use RoCol to give lower bounds on the amount of memory traffic in a computer with fourlevels of memory (registers, cache, TLB, and memory), taking into account such "messy" features as block moves and setassociative caches. The insights from this analysis lead to a bitreversal algorithm whose performance is close to the theoretical minimum. Experiments show it performs significantly better than every program in a comprehensive study of 30 published algo...
Efficient parallel algorithms for closest point problems
, 1994
"... This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study of known sequential methods, I introduce new parallel algorithms for these problems that are both efficient and practical. I present a simple and flexible programming model for designing and analyzing parallel algorithms. Also, I describe fast parallel algorithms for nearestneighbor searching and constructing Voronoi diagrams. Finally, I demonstrate that my algorithms actually obtain good performance on a wide variety of machine architectures. The key algorithmic ideas that I examine are exploiting spatial locality, and random sampling. Spatial decomposition provides allows many concurrent threads to work independently of one another in local areas of a shared data structure. Random sampling provides a simple way to adaptively decompose irregular problems, and to balance workload among many threads. Used together, these techniques result in effective algorithms for a wide range of geometric problems. The key
Predicting Performance on SMPs. A Case Study: The SGI Power Challenge
 IN PROCEEDINGS OF THE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2000
, 2000
"... In this work we study the issue of performance prediction on the SGIPower Challenge, a typical representative of the class of sharedmemory Symmetric MultiProcessors. On such a platform, the cost of memory accesses varies depending on their locality and on contention among processors. By running a ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
In this work we study the issue of performance prediction on the SGIPower Challenge, a typical representative of the class of sharedmemory Symmetric MultiProcessors. On such a platform, the cost of memory accesses varies depending on their locality and on contention among processors. By running a carefully designed suite of microbenchmarks, we provide quantitative evidence that the interaction with the memory hierarchy affects performance far more substantially than other phenomena related to contention. We also fit three cost functions based on variants of the BSP model, which do not account for the hierarchy, and a newly defined function F, expressed in terms of hardware counters, which captures both memory hierarchy and contention effects. We test the accuracy of all the functions on both synthetic and application benchmarks showing that, unlike the other functions, F achieves an excellent level of predictivity in all cases. Although hardware counters are only available at run time, we give evidence that function F can still be employed as a prediction tool by extrapolating values of the counters from pilot runs on small input sizes.