• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

What Every Programmer Should Know About Memory (2007)

by Ulrich Drepper
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 15
Next 10 →

Investigating cache parameters of x86 family processors

by Vlastimil Babka - in Proceedings of the SPEC Benchmark Workshop 2009, ser. LNCS , 2009
"... Abstract. The excellent performance of the contemporary x86 processors is partially due to the complexity of their memory architecture, which therefore plays a role in performance engineering efforts. Unfortunately, the detailed parameters of the memory architecture are often not easily available, w ..."
Abstract - Cited by 5 (4 self) - Add to MetaCart
Abstract. The excellent performance of the contemporary x86 processors is partially due to the complexity of their memory architecture, which therefore plays a role in performance engineering efforts. Unfortunately, the detailed parameters of the memory architecture are often not easily available, which makes it difficult to design experiments and evaluate results when the memory architecture is involved. To remedy this lack of information, we present experiments that investigate detailed parameters of the memory architecture, focusing on such information that is typically not available elsewhere. 1

Cache Replacement Policies for Multicore Processors

by Avinatan Hassidim , 2009
"... Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore’s law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore’s law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one is tempted to assume that when the cores are independent then the current design performs well. This work infirms this assumption by showing that even when the cores run completely independent tasks, there exist dependencies arising from running on the same chip, and using the same cache. These dependencies cause the standard caching algorithms to underperform. To address the new challenge, we revisit some aspects of the classical caching design. More specifically, we focus on the page replacement policy of the first cache shared between all the cores (usually the L2 cache). We make the simplifying assumption that since the cores are running independent tasks, they are accessing disjoint memory locations (in particular this means that maintaining coherency is not an issue). We show, that even under this simplifying assumption, the multicore case is fundamentally different then the single core case. In particular 1. LRU performs poorly, even with resource augmentation. 2. The offline version of the caching problem is NP complete. Any attempt to design an efficient cache for a multicore machine in which the cores may access the same memory has to perform well also in this simpler setting. We provide some intuition to what an efficient solution could look like, by 1. Partly characterizing the offline solution, showing that it is determined by the part of the cache which is devoted to each core at every timestep. 2. Presenting a PTAS for the offline problem, for some range of the parameters. In the recent years, multicore caching was the subject of extensive experimental research. The conclusions of some of these works are that LRU is inefficient in practice. The heuristics which they propose to replace it are based on dividing the cache between cores, and handling each part independently. Our work can be seen as a theoretical explanation to the results of these experiments.

OPTIMIZING THE DOUBLE DESCRIPTION METHOD FOR NORMAL SURFACE ENUMERATION

by Benjamin A. Burton
"... Abstract. Many key algorithms in 3-manifold topology involve the enumeration of normal surfaces, which is based upon the double description method for finding the vertices of a convex polytope. Typically we are only interested in a small subset of these vertices, thus opening the way for substantial ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract. Many key algorithms in 3-manifold topology involve the enumeration of normal surfaces, which is based upon the double description method for finding the vertices of a convex polytope. Typically we are only interested in a small subset of these vertices, thus opening the way for substantial optimization. Here we give an account of the vertex enumeration problem as it applies to normal surfaces and present new optimizations that yield strong improvements in both running time and memory consumption. The resulting algorithms are tested using the freely available software package Regina. 1.

FP7-215013 D3.3: Resource Usage Modeling

by Vlastimil Babka, Lubomír Bulej, Johan Kraft, Peter Libič, Cristina Seceleanu , 2009
"... Project name: Contract number: Project deliverable: Author(s): Work package: Work package leader: ..."
Abstract - Add to MetaCart
Project name: Contract number: Project deliverable: Author(s): Work package: Work package leader:

Evaluating Multi-Core Platforms for HPC . . .

by Alexander S. van Amesfoort, Ana L. Varbanescu, Henk J. Sips, Rob V. van Nieuwpoort , 2009
"... Multi-core platforms have proven themselves able to accelerate numerous HPC applications. But programming dataintensive applications on such platforms is a hard, and not yet solved, problem. Not only do modern processors favor compute-intensive code, they also have diverse architectures and incompat ..."
Abstract - Add to MetaCart
Multi-core platforms have proven themselves able to accelerate numerous HPC applications. But programming dataintensive applications on such platforms is a hard, and not yet solved, problem. Not only do modern processors favor compute-intensive code, they also have diverse architectures and incompatible programming models. And even after making a difficult platform choice, extensive programming effort must be invested with an uncertain performance outcome. By taking the plunge on an irregular, data-intensive application, we present an evaluation of three platform types, namely the generic multi-core CPU, the STI Cell/B.E., and the GPU. We evaluate these platforms in terms of application performance, programming effort and cost. Although we do not select a clear winner, we do provide a list of guidelines to assist in platform choice and development of similar data-intensive applications.

State-of-the-art in heterogeneous computing

by Andre R. Brodtkorb , Christopher Dyken , Trond R. Hagen , Jon M. Hjelmervik , Olaf O. Storaasli , 2010
"... Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as wel ..."
Abstract - Add to MetaCart
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

Faculty of Mathematics and Physics

by unknown authors , 2008
"... 2008 I would like to thank my supervisor, Doc. Ing. Petr T˚uma, Dr., for his guidance, advices and time he spent helping me with this thesis. I would also like to thank all members of Distributed Systems Research Group for their help and encouragement. Last but not least, I thank my whole family for ..."
Abstract - Add to MetaCart
2008 I would like to thank my supervisor, Doc. Ing. Petr T˚uma, Dr., for his guidance, advices and time he spent helping me with this thesis. I would also like to thank all members of Distributed Systems Research Group for their help and encouragement. Last but not least, I thank my whole family for their endless support. I declare that I have elaborated this thesis on my own and listed all used references.

The Offline Scheduler for Embedded Vehicular Systems

by Raz Ben-yehuda, Yair Wiseman
"... Abstract—Nowadays, various transportation means use Linux as the operating system for their embedded control systems; however Linux uses all the processes in the hardware equally. This motivates some designer to develop an entire new Operating System for the vehicle; where a modification of Linux ca ..."
Abstract - Add to MetaCart
Abstract—Nowadays, various transportation means use Linux as the operating system for their embedded control systems; however Linux uses all the processes in the hardware equally. This motivates some designer to develop an entire new Operating System for the vehicle; where a modification of Linux can produce the same and even a better product. This paper explains how this modification was designed and implemented in a commercial transportation company. 1.

The Offline Scheduler for Embedded Transportation Systems

by Raz Ben-yehuda, Bitband Ltd
"... Abstract—Nowadays, various transportation means use Linux as the operating system for their embedded control systems; however Linux uses all the processes in the hardware equally. This motivates some designer to develop an entire new Operating System for the vehicle; where a modification of Linux ca ..."
Abstract - Add to MetaCart
Abstract—Nowadays, various transportation means use Linux as the operating system for their embedded control systems; however Linux uses all the processes in the hardware equally. This motivates some designer to develop an entire new Operating System for the vehicle; where a modification of Linux can produce the same and even a better product. This paper explains how this modification was designed and implemented in a commercial transportation company. O I.

Controlling Cache Utilization of HPC Applications

by Swann Perarnau, Marc Tchiboukdjian, Guillaume Huard , 2011
"... This paper discusses the use of software cache partitioning techniques to study and improve cache behavior of HPC applications. Cache partitioning is traditionally considered as an hardware/OS solution to shared caches issues, particularly to resource utilization fairness between multiple processes. ..."
Abstract - Add to MetaCart
This paper discusses the use of software cache partitioning techniques to study and improve cache behavior of HPC applications. Cache partitioning is traditionally considered as an hardware/OS solution to shared caches issues, particularly to resource utilization fairness between multiple processes. We believe that, in the HPC context of a single application being studied/optimized on the system, with a single thread per core, cache partitioning can be used in new and interesting ways. First, we propose an implementation of software cache partitioning using the well known page coloring technique. This implementation differs from existing work by giving control of the partitioning to the application programmer. Developed on the most popular OS in HPC (Linux), this cache control scheme has low overhead both in memory and CPU while being simple to use. Second, we show how this user-controlled cache partitioning can lead to efficient measurements of the cache behavior of a parallel scientific visualization application. While existing works require expensive binary instrumentation of an application to obtain its working sets, our method only needs a few unmodified runs on the target platform. Finally, we discuss the use of our scheme to optimize memory intensive applications by isolating each of their critical data structures into dedicated cache partitions. This isolation allows the analysis of each structure cache requirements and leads to new and significant optimization strategies. To the best of our knowledge, no other existing tool enables such tuning of HPC applications.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University