Results 1 - 10
of
29
Oblivious algorithms for multicores and network of processors
, 2009
"... We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multi-level caching model for multic ..."
Abstract
-
Cited by 29 (9 self)
- Add to MetaCart
We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multi-level caching model for multicores, and we propose a multicore-oblivious approach to algorithms and schedulers for HM. We instantiate this approach with provably efficient multicore-oblivious algorithms for matrix and prefix sum computations, FFT, the Gaussian Elimination paradigm (which represents an important class of computations including Floyd-Warshall’s all-pairs shortest paths, Gaussian Elimination and LU decomposition without pivoting), sorting, list ranking, Euler tours and connected components. We then use the network oblivious framework proposed earlier as an oblivious framework for a network of processors, and we present provably efficient network-oblivious algorithms for sorting, the Gaussian Elimination paradigm, list ranking, Euler tours and connected components. Many of these networkoblivious algorithms perform efficiently also when executed on the Decomposable-BSP.
Low depth cache-oblivious algorithms
- In Proc. ACM SPAA
, 2010
"... In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural s ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on a variety of parallel cache architectures. The approach is to design nested parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algo-rithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators. Our sorting algorithm yields the first cache-oblivious algorithms with polylogarithmic depth and low sequential cache complexities for list ranking, Euler tour tree labeling, tree contraction, least common ancestors, graph connectivity, and minimum spanning forest. Using known mappings, our results lead to low cache complexities on multi-core processors (and shared-memory multiprocessors) with a single level of private caches or a single shared cache. We generalize these
Parallel External Memory Graph Algorithms
"... In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to solutions for many problems on trees, such as computing the Euler to ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one of the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to solutions for many problems on trees, such as computing the Euler tour, preorder and postorder numbering of the vertices, the depth of each vertex and the sizes of subtrees rooted at each vertex of the tree. We also study the problems of computing the connected components of a graph and minimum spanning tree of a connected graph. All our solutions provide an optimal speedup of O(p) in parallel I/O complexity compared to the single-processor external memory versions of the algorithms.
Resource Oblivious Sorting on Multicores
"... Abstract. We present a new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(n log n) time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorith ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
Abstract. We present a new deterministic sorting algorithm that interleaves the partitioning of a sample sort with merging. Sequentially, it sorts n elements in O(n log n) time cache-obliviously with an optimal number of cache misses. The parallel complexity (or critical path length) of the algorithm is O(log n log log n), which improves on previous bounds for deterministic sample sort. Given a multicore computing environment with a global shared memory and p cores, each having a cache of size M organized in blocks of size B, our algorithm can be scheduled effectively on these p cores in a cache-oblivious manner. We improve on the above cache-oblivious processor-aware parallel implementation by using the Priority Work Stealing Scheduler (PWS) that we presented recently in a companion paper [12]. The PWS scheduler is both processor- and cache-oblivious (i.e., resource oblivious), and it tolerates asynchrony among the cores. Using PWS, we obtain a resource oblivious scheduling of our sorting algorithm that matches the performance of the processor-aware version. Our analysis includes the delay incurred by false-sharing. We also establish good bounds for our algorithm with the randomized work stealing scheduler. 1
A Memory Access Model for Highly-threaded Many-core Architectures
- SUBMITTED AND ACCEPTED BY ICPADS'2012
, 2012
"... Many-core architectures are excellent in hiding memory-access latency by low-overhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically thes ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Many-core architectures are excellent in hiding memory-access latency by low-overhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more fine-grained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the Floyd-Warshall algorithm and Johnson’s algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough memory-access latency and validates the intuition that the Floyd-Warshall algorithm performs better on these machines.
Scheduling Irregular Parallel Computations on Hierarchical Caches
, 2010
"... Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other shared-memory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed high-level mode ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Making efficient use of cache hierarchies is essential for achieving good performance on multicore and other shared-memory parallel machines. Unfortunately, designing algorithms for complicated cache hierarchies can be difficult and tedious. To address this, recent work has developed high-level models that expose locality in a manner that is oblivious to particular cache or processor organizations, placing the burden of making effective use of a parallel machine on a runtime task scheduler rather than the algorithm designer/programmer. This paper continues this line of work by (i) developing a new model for parallel cache cost, (ii) developing a task scheduler for irregular tasks on cache hierarchies, and (iii) proving that the scheduler assigns tasks to processors in a work-efficient manner (including cache costs) relative to the model. As with many previous models, our model allows algorithms to be analyzed using a single level of cache with parameters M (cache size) and B (cache-line size), and algorithms can be written cache obliviously (with no choices made based on machine parameters). Unlike previous models, our cost ̂ Qα(n; M, B), for problem size n, captures costs due to work-space imbalance among tasks, and we prove a lower bound that shows that some sort of penalty is needed to achieve work efficiency. Nevertheless, for many algorithms, ̂ Qα() is asymptotically equal to
Efficient resource oblivious algorithms for multicores with false sharing
- In Proc. IEEE IPDPS
, 2012
"... Abstract—We consider algorithms for a multicore environment in which each core has its own private cache and false sharing can occur. False sharing happens when two or more processors access the same block (i.e., cache-line) in parallel, and at least one processor writes into a location in the block ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract—We consider algorithms for a multicore environment in which each core has its own private cache and false sharing can occur. False sharing happens when two or more processors access the same block (i.e., cache-line) in parallel, and at least one processor writes into a location in the block. False sharing causes different processors to have inconsistent views of the data in the block, and many of the methods currently used to resolve these inconsistencies can cause large delays. We analyze the cost of false sharing both for variables stored on the execution stacks of the parallel tasks and for output variables. Our main technical contribution is to establish a low cost for this overhead for the class of multithreaded blockresilient HBP (Hierarchical Balanced Parallel) computations. Using this and other techniques, we develop block-resilient HBP algorithms with low false sharing costs for several fundamental problems including scans, matrix multiplication, FFT, sorting, and hybrid block-resilient HBP algorithms for list ranking and graph connected components. Most of these algorithms are derived from known multicore algorithms, but are further refined to achieve a low false sharing overhead. Our algorithms make no mention of machine parameters, and our analysis of the false sharing overhead is mostly in terms of the the number of tasks generated in parallel during the computation, and thus applies to a variety of schedulers. Keywords-false-sharing; cache-efficiency; multicores I.
Towards optimizing energy costs of algorithms for shared memory architectures
- In SPAA
, 2010
"... Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone: it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on shared memory mul ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Energy consumption by computer systems has emerged as an important concern. However, the energy consumed in executing an algorithm cannot be inferred from its performance alone: it must be modeled explicitly. This paper analyzes energy consumption of parallel algorithms executed on shared memory multicore processors. Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied. We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement. We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory. The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters.
I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors
"... Abstract—The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-efficient distribution sweeping ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract—The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-efficient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of O(sort
On (Dynamic) Range Minimum Queries in External Memory
"... Abstract. We study the one-dimensional range minimum query (RMQ) problem in the external memory model. We provide the first spaceoptimal solution to the batched static version of the problem. On an instance with N elements and Q queries, our solution takes Θ(sort(N + N+Q ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We study the one-dimensional range minimum query (RMQ) problem in the external memory model. We provide the first spaceoptimal solution to the batched static version of the problem. On an instance with N elements and Q queries, our solution takes Θ(sort(N + N+Q