Results 11 - 20
of
34
Portable High-Performance Programs
, 1999
"... right notice and this permission notice are preserved on all copies. ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
right notice and this permission notice are preserved on all copies.
Can Parallel Algorithms Enhance Serial Implementation? (Extended Abstract)
, 1996
"... The broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of th ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
The broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several far-reaching outcomes: (1) Reliance on "predictability of reference" in the design of computer systems will increase. (2) Parallel algorithms will be taught as part of the standard computer science and engineering undergraduate curriculum irrespective of whether (or when) parallel processing will become ubiquitous in the generalpurpose computing world. (3) A strategic agenda for high-performance parallel computing: A multi-stage agenda, which in no stage compromises user-friendliness of the programmer 's...
Rectilinear Steiner Tree Minimization on a Workstation
- In Proceedings of the DIMACS Workshop on Computational Support for Discrete Mathematics
, 1992
"... : We describe a series of optimizations to Dreyfus and Wagner's dynamic program for finding a Steiner minimal tree on a graph. Our interest is in finding rectilinear Steiner minimal trees on k pins, for which the Dreyfus and Wagner algorithm runs in O(k 2 3 k ) time. The original, unoptimized, ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
: We describe a series of optimizations to Dreyfus and Wagner's dynamic program for finding a Steiner minimal tree on a graph. Our interest is in finding rectilinear Steiner minimal trees on k pins, for which the Dreyfus and Wagner algorithm runs in O(k 2 3 k ) time. The original, unoptimized, code was hopelessly I/O-bound for k ? 17, even on a workstation with 16 megabytes of main memory. Our optimized code runs twenty times faster than the original code. It is not I/O-bound even when run on a fast 8-megabyte workstation with a slow access path to a remote disk. Our most significant optimization technique was to reorder the computation, obtaining locality of reference at all levels of the memory hierarchy. We made some improvements on the Dreyfus-Wagner recurrences, for the rectilinear case. We developed a special-purpose technique for compressing the data in our disk files by a factor of nine. Finally, we found it necessary to repair a subtle flaw in random(), the 4.3bsd Unix ra...
Towards a Model for Portable Parallel Performance: Exposing the Memory Hierarchy
, 1992
"... The challenge of building a program that attains high performance on a variety of parallel computers is formidable. Actually, attaining high performance on a variety of sequential computers is challenging. Indeed, its hard enough to get high performance on a single sequential computer. Constructing ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
The challenge of building a program that attains high performance on a variety of parallel computers is formidable. Actually, attaining high performance on a variety of sequential computers is challenging. Indeed, its hard enough to get high performance on a single sequential computer. Constructing a high-performance program requires detailed knowledge of the computer 's architectural features --- its memory hierarchy in particular. This knowledge constitutes a detailed, albeit informal, model of computation against which the performance program is written. Similar characteristics must be considered in building a portable high-performance program but the appropriate details are elusive and often unavailable when the program is written. In order to support this type of programming, we call for a generic model. Such a model is parameterized by machine parameters. Judicious specification of these parameters results in a specific model that should capture the performance-relevant features...
Optimizing Parallel SPMD Programs
- In Languages and Compilers for Parallel Computing
, 1994
"... . We present compiler optimization techniques for explicitly parallel programs that communicate through a shared address space. The source programs are written in a single program multiple data (SPMD) style, and the machine target is a multiprocessor with physically distributed memory and hardware o ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
. We present compiler optimization techniques for explicitly parallel programs that communicate through a shared address space. The source programs are written in a single program multiple data (SPMD) style, and the machine target is a multiprocessor with physically distributed memory and hardware or software support for a single address space. Unlike sequential programs or data-parallel programs, SPMD programs require cycle detection, as defined by Shasha and Snir, to perform any kind of code motion on shared variable accesses. Cycle detection finds those accesses that, if reordered by either the hardware or software, could violate sequential consistency. We improve on Shasha and Snir's algorithm for cycle detection by providing a polynomial time algorithm for SPMD programs, whereas their formulation leads to an algorithm that is exponential in the number of processors. Once cycles and local dependencies have been computed, we perform optimizations to overlap communication and computa...
Cache oblivious algorithms
- Algorithms for Memory Hierarchies, LNCS 2625
, 2003
"... Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data st ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms of the input size, but also the movement of data optimally among the different levels of the memory hierarchy. This chapter is aimed as an introduction to the “ideal-cache ” model of [22] and techniques used to design cache oblivious algorithms. The chapter also presents some experimental insights and results. Part of this work was done while the author was visiting MPI-Saarbrücken. The
Efficient parallel algorithms for closest point problems
, 1994
"... This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This dissertation develops and studies fast algorithms for solving closest point problems. Algorithms for such problems have applications in many areas including statistical classification, crystallography, data compression, and finite element analysis. In addition to a comprehensive empirical study of known sequential methods, I introduce new parallel algorithms for these problems that are both efficient and practical. I present a simple and flexible programming model for designing and analyzing parallel algorithms. Also, I describe fast parallel algorithms for nearest-neighbor searching and constructing Voronoi diagrams. Finally, I demonstrate that my algorithms actually obtain good performance on a wide variety of machine architectures. The key algorithmic ideas that I examine are exploiting spatial locality, and random sampling. Spatial decomposition provides allows many concurrent threads to work independently of one another in local areas of a shared data structure. Random sampling provides a simple way to adaptively decompose irregular problems, and to balance workload among many threads. Used together, these techniques result in effective algorithms for a wide range of geometric problems. The key
Cache-oblivious algorithms (Extended Abstract)
- In Proc. 40th Annual Symposium on Foundations of Computer Science
, 1999
"... This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cach ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z � Ω � L 2 � the number of cache misses for an m � n matrix transpose is Θ � 1 � mn � L �. The number of cache misses for either an n-point FFT or the sorting of n numbers is Θ � 1 �� � n � L � � 1 � log Z n �� �. We also give an Θ � mnp �-work algorithm to multiply an m � n matrix by an n � p matrix that incurs Θ � 1 �� � mn � np � mp � � L � mnp � L � Z � cache faults. We introduce an “ideal-cache ” model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice.
Predicting Performance on SMPs. A Case Study: The SGI Power Challenge
- IN PROCEEDINGS OF THE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2000
, 2000
"... In this work we study the issue of performance prediction on the SGI-Power Challenge, a typical representative of the class of shared-memory Symmetric MultiProcessors. On such a platform, the cost of memory accesses varies depending on their locality and on contention among processors. By running a ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
In this work we study the issue of performance prediction on the SGI-Power Challenge, a typical representative of the class of shared-memory Symmetric MultiProcessors. On such a platform, the cost of memory accesses varies depending on their locality and on contention among processors. By running a carefully designed suite of microbenchmarks, we provide quantitative evidence that the interaction with the memory hierarchy affects performance far more substantially than other phenomena related to contention. We also fit three cost functions based on variants of the BSP model, which do not account for the hierarchy, and a newly defined function F, expressed in terms of hardware counters, which captures both memory hierarchy and contention effects. We test the accuracy of all the functions on both synthetic and application benchmarks showing that, unlike the other functions, F achieves an excellent level of predictivity in all cases. Although hardware counters are only available at run time, we give evidence that function F can still be employed as a prediction tool by extrapolating values of the counters from pilot runs on small input sizes.
Space-Time Tradeoffs in Memory Hierarchies
, 1993
"... The speed of CPUs is accelerating rapidly, outstripping that of peripheral storage devices and making it increasingly difficult to keep CPUs busy. Multilevel memory hierarchies, scaled to simulate single-level memories, are increasing in importance. In this paper we introduce the Memory Hierarchy ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The speed of CPUs is accelerating rapidly, outstripping that of peripheral storage devices and making it increasingly difficult to keep CPUs busy. Multilevel memory hierarchies, scaled to simulate single-level memories, are increasing in importance. In this paper we introduce the Memory Hierarchy Game, a multi-level pebble game simulating data movement in memory hierarchies for straight-line computations. This game provides a framework for deriving upper and lower bounds on computation time and the I/O time at each level in a memory hierarchy. We apply this framework to a representative set of problems including matrix multiplication and the Fourier transform. We also discuss conditions on hierarchies under which they act as fast flat memories.

