Results 1 
3 of
3
Cache Replacement Policies for Multicore Processors
, 2009
"... Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore’s law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore’s law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one is tempted to assume that when the cores are independent then the current design performs well. This work infirms this assumption by showing that even when the cores run completely independent tasks, there exist dependencies arising from running on the same chip, and using the same cache. These dependencies cause the standard caching algorithms to underperform. To address the new challenge, we revisit some aspects of the classical caching design. More specifically, we focus on the page replacement policy of the first cache shared between all the cores (usually the L2 cache). We make the simplifying assumption that since the cores are running independent tasks, they are accessing disjoint memory locations (in particular this means that maintaining coherency is not an issue). We show, that even under this simplifying assumption, the multicore case is fundamentally different then the single core case. In particular 1. LRU performs poorly, even with resource augmentation. 2. The offline version of the caching problem is NP complete. Any attempt to design an efficient cache for a multicore machine in which the cores may access the same memory has to perform well also in this simpler setting. We provide some intuition to what an efficient solution could look like, by 1. Partly characterizing the offline solution, showing that it is determined by the part of the cache which is devoted to each core at every timestep. 2. Presenting a PTAS for the offline problem, for some range of the parameters. In the recent years, multicore caching was the subject of extensive experimental research. The conclusions of some of these works are that LRU is inefficient in practice. The heuristics which they propose to replace it are based on dividing the cache between cores, and handling each part independently. Our work can be seen as a theoretical explanation to the results of these experiments.
OntheSublinearProcessorGap for Parallel Architectures
"... Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) (where n is the size of the input) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly paralle ..."
Abstract
 Add to MetaCart
Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) (where n is the size of the input) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parallel architecture. In this paper, however, we argue that design and implementation issues of algorithms and architectures are significantly different—both in theory and in practice—between computational models with high and low degrees of parallelism. We report an observed gap in the behavior of a parallel architecture depending on the number of processors. This gap appears repeatedly in both empirical cases, when studying practical aspects of architecture design and program implementation as well as in theoretical instances when studying the behaviour of various parallel algorithms. It separates the performance, design and analysis of systems with a sublinear number of processors and systems with linearly many processors. More specifically we observe that systems with either logarithmically many cores or with O(n α)cores(withα<1) exhibit a qualitatively different behavior than a system with a linear number of cores on the size of the input, i.e., Θ(n).Theevidencewepresentsuggeststheexistence of a sharp theoretical gap between the classes of problems that can be efficiently parallelized with o(n) processors and with Θ(n) processors unless P = NC. 1
On the Sublinear Processor Gap for MultiCore Architectures
"... Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parall ..."
Abstract
 Add to MetaCart
Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parallel architecture. In this paper, however, we argue that design and implementation issues of algorithms and architectures are significantly different—both in theory and in practice—between computational models with high and low degrees of parallelism. We report an observed gap in the behavior of a CMP/parallel architecture depending on the number of processors. This gap appears repeatedly in both empirical cases, when studying practical aspects of architecture design and program implementation as well as in theoretical instances when studying the behaviour of various parallel algorithms. It separates the performance, design and analysis of systems with a sublinear number of processors and systems with linearly many processors. More specifically we observe that systems with either logarithmically many cores or with O(n α) cores (with α < 1) exhibit a qualitatively different behavior than a system with a linear number of cores on the size of the input, i.e. Θ(n). The evidence we present suggests the existence of a sharp theoretical gap between the classes of problems that can be efficiently parallelized with o(n) processors and with Θ(n) processors unless NC = P. 1