Results 1  10
of
29
Oblivious algorithms for multicores and network of processors
, 2009
"... We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multic ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multilevel caching model for multicores, and we propose a multicoreoblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicoreoblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including FloydWarshall’s allpairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient networkoblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these networkoblivious algorithms perform efficiently also when executed on the DecomposableBSP.
Low depth cacheoblivious algorithms
 In Proc. ACM SPAA
, 2010
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cacheoblivious model. We describe several cacheoblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparsematrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cacheoblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multicore processors (and sharedmemory multiprocessors) with a single level of private caches or a single shared cache. We generalize these
Parallel External Memory Graph Algorithms
"... In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the privatecache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to solutions for many problems on trees, such as computing the Euler to ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the privatecache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to solutions for many problems on trees, such as computing the Euler tour, preorder and postorder numbering of the vertices, the depth of each vertex and the sizes of subtrees rooted at each vertex of the tree. We also study the problems of computing the connected components of a graph and minimum spanning tree of a connected graph. All our solutions provide an optimal speedup of O(p) in parallel I/O complexity compared to the singleprocessor external memory versions of the algorithms.
Resource Oblivious Sorting on Multicores
"... Abstract. We present a new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(n log n) time cacheobliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorith ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We present a new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(n log n) time cacheobliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is O(log n log log n), which improves on previous bounds for deterministic sample sort. Given a multicore computing environment with a global shared memory and p cores, each having a cache of size M organized in blocks of size B, our algorithm can be scheduled effectively on these p cores in a cacheoblivious manner. We improve on the above cacheoblivious processoraware parallel implementation by using the Priority Work Stealing Scheduler (PWS) that we presented recently in a companion paper [12]. The PWS scheduler is both processor and cacheoblivious (i.e., resource oblivious), and it tolerates asynchrony among the cores. Using PWS, we obtain a resource oblivious scheduling of our sorting algorithm that matches the performance of the processoraware version. Our analysis includes the delay incurred by falsesharing. We also establish good bounds for our algorithm with the randomized work stealing scheduler. 1
A Memory Access Model for Highlythreaded Manycore Architectures
 SUBMITTED AND ACCEPTED BY ICPADS'2012
, 2012
"... Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically thes ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Manycore Memory (TMM) model which is meant to capture the important characteristics of these highlythreaded, manycore machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more finegrained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the FloydWarshall algorithm and Johnson’s algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough memoryaccess latency and validates the intuition that the FloydWarshall algorithm performs better on these machines.
Scheduling Irregular Parallel Computations on Hierarchical Caches
, 2010
"... Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other sharedmemory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed highlevel mode ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other sharedmemory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed highlevel models that expose locality in a manner that is oblivious to particular cache or processor organizations, placing the burden of making effective use of a parallel machine on a runtime task scheduler rather than the algorithm designer/programmer. This paper continues this line of work by (i) developing a new model for parallel cache cost, (ii) developing a task scheduler for irregular tasks on cache hierarchies, and (iii) proving that the scheduler assigns tasks to processors in a workefficient manner (including cache costs) relative to the model. As with many previous models, our model allows algorithms to be analyzed using a single level of cache with parameters M (cache size) and B (cacheline size), and algorithms can be written cache obliviously (with no choices made based on machine parameters). Unlike previous models, our cost ̂ Qα(n; M, B), for problem size n, captures costs due to workspace imbalance among tasks, and we prove a lower bound that shows that some sort of penalty is needed to achieve work efficiency. Nevertheless, for many algorithms, ̂ Qα() is asymptotically equal to
Efficient resource oblivious algorithms for multicores with false sharing
 In Proc. IEEE IPDPS
, 2012
"... Abstract—We consider algorithms for a multicore environment in which each core has its own private cache and false sharing can occur. False sharing happens when two or more processors access the same block (i.e., cacheline) in parallel, and at least one processor writes into a location in the block ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We consider algorithms for a multicore environment in which each core has its own private cache and false sharing can occur. False sharing happens when two or more processors access the same block (i.e., cacheline) in parallel, and at least one processor writes into a location in the block. False sharing causes different processors to have inconsistent views of the data in the block, and many of the methods currently used to resolve these inconsistencies can cause large delays. We analyze the cost of false sharing both for variables stored on the execution stacks of the parallel tasks and for output variables. Our main technical contribution is to establish a low cost for this overhead for the class of multithreaded blockresilient HBP (Hierarchical Balanced Parallel) computations. Using this and other techniques, we develop blockresilient HBP algorithms with low false sharing costs for several fundamental problems including scans, matrix multiplication, FFT, sorting, and hybrid blockresilient HBP algorithms for list ranking and graph connected components. Most of these algorithms are derived from known multicore algorithms, but are further refined to achieve a low false sharing overhead. Our algorithms make no mention of machine parameters, and our analysis of the false sharing overhead is mostly in terms of the the number of tasks generated in parallel during the computation, and thus applies to a variety of schedulers. Keywordsfalsesharing; cacheefficiency; multicores I.
Towards optimizing energy costs of algorithms for shared memory architectures
 In SPAA
, 2010
"... Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone: it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on shared memory mul ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone: it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on shared memory multicore processors. Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied. We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement. We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory. The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters.
I/OOptimal Distribution Sweeping on PrivateCache Chip Multiprocessors
"... Abstract—The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for privatecache multicore architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/Oefficient distribution sweeping ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract—The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for privatecache multicore architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/Oefficient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axisaligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of O(sort
On (Dynamic) Range Minimum Queries in External Memory
"... Abstract. We study the onedimensional range minimum query (RMQ) problem in the external memory model. We provide the first spaceoptimal solution to the batched static version of the problem. On an instance with N elements and Q queries, our solution takes Θ(sort(N + N+Q ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We study the onedimensional range minimum query (RMQ) problem in the external memory model. We provide the first spaceoptimal solution to the batched static version of the problem. On an instance with N elements and Q queries, our solution takes Θ(sort(N + N+Q