Results 1 
4 of
4
Cache Replacement Policies for Multicore Processors
, 2009
"... Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore’s law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Almost all of the modern computers use multiple cores, and the number of cores is expected to increase as hardware prices go down, and Moore’s law fails to hold. Most of the theoretical algorithmic work so far has focused on the setting where multiple cores are performing the same task. Indeed, one is tempted to assume that when the cores are independent then the current design performs well. This work infirms this assumption by showing that even when the cores run completely independent tasks, there exist dependencies arising from running on the same chip, and using the same cache. These dependencies cause the standard caching algorithms to underperform. To address the new challenge, we revisit some aspects of the classical caching design. More specifically, we focus on the page replacement policy of the first cache shared between all the cores (usually the L2 cache). We make the simplifying assumption that since the cores are running independent tasks, they are accessing disjoint memory locations (in particular this means that maintaining coherency is not an issue). We show, that even under this simplifying assumption, the multicore case is fundamentally different then the single core case. In particular 1. LRU performs poorly, even with resource augmentation. 2. The offline version of the caching problem is NP complete. Any attempt to design an efficient cache for a multicore machine in which the cores may access the same memory has to perform well also in this simpler setting. We provide some intuition to what an efficient solution could look like, by 1. Partly characterizing the offline solution, showing that it is determined by the part of the cache which is devoted to each core at every timestep. 2. Presenting a PTAS for the offline problem, for some range of the parameters. In the recent years, multicore caching was the subject of extensive experimental research. The conclusions of some of these works are that LRU is inefficient in practice. The heuristics which they propose to replace it are based on dividing the cache between cores, and handling each part independently. Our work can be seen as a theoretical explanation to the results of these experiments.
Making queries tractable on big data with preprocessing
 In VLDB
, 2013
"... A query class is traditionally considered tractable if there exists a polynomialtime (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
A query class is traditionally considered tractable if there exists a polynomialtime (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data offline, so that queries in the class can be subsequently evaluated on the data efficiently. This paper aims to provide a formal foundation for this approach in terms of computational complexity. (1) We propose a set of Πtractable queries, denoted by ΠT0Q, to characterize classes of queries that can be answered in parallel polylogarithmic time (NC) after PTIME preprocessing. (2) We show that several natural query classes are Πtractable and are feasible on big data. (3) We also study a set ΠTQ of query classes that can be effectively converted to Πtractable queries by refactorizing its data and queries for preprocessing. We introduce a form of NC reductions to characterize such conversions. (4) We show that a natural query class is complete for ΠTQ. (5) We also show that ΠT0Q ⊂ P unless P = NC, i.e., the set ΠT0Q of all Πtractable queries is properly contained in the set P of all PTIME queries. Nonetheless, ΠTQ = P, i.e., all PTIME query classes can be made Πtractable via proper refactorizations. This work is a step towards understanding the tractability of queries in the context of big data. 1.
On the Sublinear Processor Gap for MultiCore Architectures
"... Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parall ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parallel architecture. In this paper, however, we argue that design and implementation issues of algorithms and architectures are significantly different—both in theory and in practice—between computational models with high and low degrees of parallelism. We report an observed gap in the behavior of a CMP/parallel architecture depending on the number of processors. This gap appears repeatedly in both empirical cases, when studying practical aspects of architecture design and program implementation as well as in theoretical instances when studying the behaviour of various parallel algorithms. It separates the performance, design and analysis of systems with a sublinear number of processors and systems with linearly many processors. More specifically we observe that systems with either logarithmically many cores or with O(n α) cores (with α < 1) exhibit a qualitatively different behavior than a system with a linear number of cores on the size of the input, i.e. Θ(n). The evidence we present suggests the existence of a sharp theoretical gap between the classes of problems that can be efficiently parallelized with o(n) processors and with Θ(n) processors unless NC = P. 1