• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Influence of Caches on the Performance of Heaps

by Anthony LaMarca , Richard E. Ladner
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 42
Next 10 →

Cache-Conscious Structure Layout

by Trishul M. Chilimbi , Mark D. Hill, James R. Larus , 1999
"... Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary appro ..."
Abstract - Cited by 164 (8 self) - Add to MetaCart
Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs. This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement technique-lustering and colorinet improve cache performance by increasing a pointer structure’s spatial and temporal locality, and by reducing cache-conflicts. To reduce the cost of applying these techniques, this paper discusses two strategies-cache-conscious reorganization and cacheconscious allocation--and describes two semi-automatic toolsccmorph and ccmalloc-that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefit-n most cases, significantly outperforming state-of-the-art prefetching.

The influence of caches on the performance of sorting

by Anthony LaMarca, Richard E. Ladner - IN PROCEEDINGS OF THE SEVENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS , 1997
"... We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all t ..."
Abstract - Cited by 104 (3 self) - Add to MetaCart
We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all three algorithms the improvementincache performance leads to a reduction in total execution time. We also investigate the performance of radix sort. Despite the extremely low instruction count incurred by this linear time sorting algorithm, its relatively poor cache performance results in worse overall performance than the e cient comparison based sorting algorithms. For each algorithm we provide an analysis that closely predicts the number of cache misses incurred by the algorithm.

Nonlinear Array Layouts for Hierarchical Memory Systems

by Siddhartha Chatterjee, Vibhor V. Jain, Alvin R. Lebeck, Shyam Mundhra, Mithuna Thottethodi , 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract - Cited by 67 (4 self) - Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.

Cache Oblivious Search Trees via Binary Trees of Small Height

by Gerth Stølting Brodal, Rolf Fagerberg, Riko Jacob, Rolf Fagerberg Riko Jacob - In Proc. ACM-SIAM Symp. on Discrete Algorithms , 2002
"... We propose a version of cache oblivious search trees which is simpler than the previous proposal of Bender, Demaine and Farach-Colton and has the same complexity bounds. In particular, our data structure avoids the use of weight balanced B-trees, and can be implemented as just a single array of ..."
Abstract - Cited by 61 (9 self) - Add to MetaCart
We propose a version of cache oblivious search trees which is simpler than the previous proposal of Bender, Demaine and Farach-Colton and has the same complexity bounds. In particular, our data structure avoids the use of weight balanced B-trees, and can be implemented as just a single array of data elements, without the use of pointers. The structure also improves space utilization.

Cache-oblivious priority queue and graph algorithm applications

by Lars Arge, Bryan Holland-minkley, Michael A. Bender, J. Ian Munro, Erik D. Demaine - In Proc. 34th Annual ACM Symposium on Theory of Computing , 2002
"... In this paper we develop an optimal cache-oblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hi ..."
Abstract - Cited by 56 (10 self) - Add to MetaCart
In this paper we develop an optimal cache-oblivious priority queue data structure, supporting insertion, deletion, and deletemin operations in O ( 1 B logM/B N) amortized memory B transfers, where M and B are the memory and block transfer sizes of any two consecutive levels of a multilevel memory hierarchy. In a cache-oblivious data structure, M and B are not used in the description of the structure. The bounds match the bounds of several previously developed external-memory (cache-aware) priority queue data structures, which all rely crucially on knowledge about M and B. Priority queues are a critical component in many of the best known external-memory graph algorithms, and using our cache-oblivious priority queue we develop several cacheoblivious graph algorithms.

A New Implementation and Detailed Study of Breakpoint Analysis

by Bernard M. E. Moret, Stacia Wyman, David A. Bader, Tandy Warnow, Mi Yan , 2001
"... ..."
Abstract - Cited by 47 (16 self) - Add to MetaCart
Abstract not found

Fast Priority Queues for Cached Memory

by Peter Sanders - ACM Journal of Experimental Algorithmics , 1999
"... This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithm ..."
Abstract - Cited by 46 (6 self) - Add to MetaCart
This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs

Towards A Discipline Of Experimental Algorithmics

by Bernard M. E. Moret
"... The last 20 years have seen enormous progress in the design of algorithms, but very little of it has been put into practice, even within academia; indeed, the gap between theory and practice has continuously widened over these years. Moreover, many of the recently developed algorithms are very hard ..."
Abstract - Cited by 33 (8 self) - Add to MetaCart
The last 20 years have seen enormous progress in the design of algorithms, but very little of it has been put into practice, even within academia; indeed, the gap between theory and practice has continuously widened over these years. Moreover, many of the recently developed algorithms are very hard to characterize theoretically and, as initially described, suffer from large running-time coefficients. Thus the algorithms and data structures community needs to return to implementation as the standard of value; we call such an approach Experimental Algorithmics. Experimental Algorithmics studies algorithms and data structures by joining experimental studies with the more traditional theoretical analyses. Experimentation with algorithms and data structures is proving indispensable in the assessment of heuristics for hard problems, in the design of test cases, in the characterization of asymptotic behavior of complex algorithms, in the comparison of competing designs for tractabl...

Optimizing Graph Algorithms for Improved Cache Performance

by Joon-sang Park, Michael Penner, Viktor K. Prasanna - In Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS 2002), Fort Lauderdale, FL , 2002
"... In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the it ..."
Abstract - Cited by 27 (0 self) - Add to MetaCart
In this paper, we develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall Algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of Ω(N 3 / C), where N and C are the problem size and cache size respectively. Experimental results show that this cache-oblivious implementation shows more than 6× improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Secondly, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for Minimum Spanning Tree. For these algorithms, we demonstrate up to 2 × improvement in real execution time by using a simple cachefriendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of 2 ×- 3 × in real execution time by using the technique of making the algorithm initially work on sub-problems to generate a sub-optimal solution and then solving the whole problem using the sub-optimal solution as a starting point.

Industrial Applications of High-Performance Computing for Phylogeny Reconstruction

by David A. Bader, Bernard Moret, Lisa Vawter , 2001
"... Phylogenies (that is, tree-of-life relationships) derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Real-world interest is strong in determining these relationships. For example, pharmaceutical companies may use phylogeny reconstru ..."
Abstract - Cited by 25 (3 self) - Add to MetaCart
Phylogenies (that is, tree-of-life relationships) derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Real-world interest is strong in determining these relationships. For example, pharmaceutical companies may use phylogeny reconstruction in drug discovery for finding plants with similar gene production. Health organizations study the evolution and spread of viruses such as HIV to gain understanding of future outbreaks. And governments are interested in aiding the production of foodstuffs like rice, wheat, and corn, by understanding the genetic code. Yet very few techniques are available for such phylogenetic reconstructions. Appropriate tools for analyzing such data may help resolve some difficult phylogenetic reconstruction problems; indeed, this new source of data has been embraced by many biologists in their phylogenetic work. With the rapid accumulation of whole genome sequences for a wide diversity of taxa, phylogenetic reconstruction based on changes in gene order and gene content is showing promise, particularly for resolving deep (i.e., old) branches. However, reconstruction from gene-order data is even more computationally intensive than reconstruction from sequence data, particularly in groups with large numbers of genes and highly rearranged genomes. We have developed a software suite, GRAPPA, that extends the breakpoint analysis (BPAnalysis) method of Sankoff and Blanchette while running much faster: in a recent analysis of a collection of chloroplast data for species of Campanulaceae on a 512-processor Linux supercluster with Myrinet, we achieved a one-million-fold speedup over BPAnalysis. GRAPPA currently can use either breakpoint or inversion distance (computed exactly) for its computati...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University