Results 1 
5 of
5
Graph Algorithms for Multicores with Multilevel Caches
, 2009
"... Historically, the primary model of computation employed in the design and analysis of algorithms has been the sequential RAM model. However, recent developments in computer architecture have reduced the efficacy of the sequential RAM model for algorithmic development. In response, theoretical comput ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Historically, the primary model of computation employed in the design and analysis of algorithms has been the sequential RAM model. However, recent developments in computer architecture have reduced the efficacy of the sequential RAM model for algorithmic development. In response, theoretical computer scientists have developed models of computation which better reflect these modern architectures. In this project, we consider a variety of graph problems on parallel, cacheefficient, and multicore models of computation. We introduce each model by defining the analysis of algorithms on these models. Then, for each model, we present current results for the problems of prefix sums, list ranking, various tree problems, connected components, and minimum spanning tree. Finally, we present our novel results, which include the multicore oblivious extension of current results on a private cache multicore model to a more general multilevel multicore
Efficient Scheduling for Parallel Memory Hierarchies (Regular Submission)
"... This paper presents a scheduling algorithm for efficiently implementing nestedparallel computations on parallel memory hierarchies (trees of caches). To capture the cache cost of nestedparallel computations we introduce a parallel version of the ideal cache model. In the model algorithms can be wr ..."
Abstract
 Add to MetaCart
This paper presents a scheduling algorithm for efficiently implementing nestedparallel computations on parallel memory hierarchies (trees of caches). To capture the cache cost of nestedparallel computations we introduce a parallel version of the ideal cache model. In the model algorithms can be written cache obliviously (no choices are made based on machine parameters) and analyzed using a single level of cache with parameters Z (cache size) and L (cache line size), and a parameter α specifying the algorithm’s parallelism (for input size n, n α represents the number of processors that can be effectively used). For several fundamental algorithms we show that the cache cost in the parallel ideal cache model is optimal, matching the sequential bounds, with a parallelism α → 1. For example, for cacheoblivious sorting of n keys, the cache cost is Q ∗ (n; Z, L) = Θ((n/L)log Z+2 n). Our scheduler guarantees that the number of misses across all caches at each level i of the machine’s hierarchy is at most the cache cost Q ∗ (n; Zi/3, Li) as analyzed for an algorithm. Machine hierarchies are modeled as trees of caches using a symmetric variant of the parallel memory hierarchy (PMH) model. In this model, every cache at level i is of size Zi, has line size Li, transfer cost Ci (the cost of fetching a line of data from its parent cache at level i + 1), and child fanout fi. Each leaf node (level 0) is a processor, with parameters set so that its cost corresponds to the processor’s work (i.e., its instruction count). Finally, we show that if the algorithm parallelism exceeds the machine parallelism (as defined in the paper) the work is balanced including the cost of cache misses. In particular for an hlevel memory hierarchy, our scheduler guarantees a total runtime of T(n) = O ( ∑h−1 i=0 Ci ̂ Qα(n; Zi/3, Li)
Accurate and Fast Simulations of Large Scale Distributed Computing Systems
, 2011
"... Thèse soutenue publiquement le date à définir, devant le jury composé de: ..."
Abstract
 Add to MetaCart
Thèse soutenue publiquement le date à définir, devant le jury composé de:
On the Sublinear Processor Gap for MultiCore Architectures
"... Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parall ..."
Abstract
 Add to MetaCart
Abstract. In the past, parallel algorithms were developed, for the most part, under the assumption that the number of processors is Θ(n) and that if in practice the actual number was smaller, this could be resolved using Brent’s Lemma to simulate the highly parallel solution on a lowerdegree parallel architecture. In this paper, however, we argue that design and implementation issues of algorithms and architectures are significantly different—both in theory and in practice—between computational models with high and low degrees of parallelism. We report an observed gap in the behavior of a CMP/parallel architecture depending on the number of processors. This gap appears repeatedly in both empirical cases, when studying practical aspects of architecture design and program implementation as well as in theoretical instances when studying the behaviour of various parallel algorithms. It separates the performance, design and analysis of systems with a sublinear number of processors and systems with linearly many processors. More specifically we observe that systems with either logarithmically many cores or with O(n α) cores (with α < 1) exhibit a qualitatively different behavior than a system with a linear number of cores on the size of the input, i.e. Θ(n). The evidence we present suggests the existence of a sharp theoretical gap between the classes of problems that can be efficiently parallelized with o(n) processors and with Θ(n) processors unless NC = P. 1