Results 1  10
of
12
Explicit MultiThreading (XMT) Bridging Models for Instruction Parallelism
 Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1998
"... The paper envisions an extension to a standard instruction set which efficiently implements PRAM algorithms using explicit multithreaded instructionlevel parallelism (ILP); that is, Explicit MultiThreading (XMT), a finegrained computational paradigm covering the spectrum from algorithms throu ..."
Abstract

Cited by 30 (12 self)
 Add to MetaCart
The paper envisions an extension to a standard instruction set which efficiently implements PRAM algorithms using explicit multithreaded instructionlevel parallelism (ILP); that is, Explicit MultiThreading (XMT), a finegrained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced; new elements are added where needed. The more detailed presentation is by way of a bridging model. Among other things, a bridging model provides a design space for algorithm designers and programmers, as well as a design space for computer architects. It is convenient to describe our wider vision regarding "parallelcomputingonachip" as a twostage development and therefore two bridging models are presented: Spawnbased multithreading (SpawnMT) and Elastic multithreading (EMT). The case for SpawnMT (or, alternatively, EMT) as a bridging model relies on the following evidence. (1) SpawnMT comprises an "instruction set level", wh...
Ultimate Parallel List Ranking?
 Journal of Parallel and Distributed Computing
, 2000
"... Two improved listranking algorithms are presented. The "peelingoff" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algor ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
Two improved listranking algorithms are presented. The "peelingoff" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algorithms where required for the best performance, now this single algorithm suffices. If the problem size is much larger than the number of available processors, then the "sparserulingsets" algorithm is even better. In previous versions this algorithm had very restricted practical application because of the large number of communication rounds it was performing. This main weakness of this algorithm is overcome by adding two new ideas, each of which reduces the number of communication rounds by a factor of two. 1 Introduction A list is a basic data structure: it consists of nodes which are linked together, so that every node has precisely one predecessor and one successor, except for the initial n...
Accessing Multiple Sequences Through Set Associative Caches
 In Proc
, 1999
"... The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is n ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is not self evident since caches have a fixed and quite restrictive algorithm choosing the content of the cache. We investigate the impact of this restriction for the frequently occurring case of access to multiple sequences. We show that any access pattern to k = \Theta(M=B ) sequential data streams can be efficiently supported on an away set associative cache with capacity M and line size B. The bounds are tight up to lower order terms.
Scanning Multiple Sequences Via Cache Memory
 Algorithmica
, 2003
"... We consider the simple problem of scanning multiple sequences. There are k sequences of total length N which are to be scanned concurrently. One pointer into each sequence is maintained and an adversary specifies which pointer is to be advanced. The concept of scanning multiple sequence is ubiquitou ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
We consider the simple problem of scanning multiple sequences. There are k sequences of total length N which are to be scanned concurrently. One pointer into each sequence is maintained and an adversary specifies which pointer is to be advanced. The concept of scanning multiple sequence is ubiquitous in algorithms designed for hierarchical memory.
Solving Fundamental Problems on SparseMeshes
 IEEE Transactions on Parallel & Distributed Systems
, 1998
"... A sparsemesh, which has PUs on the diagonal of a twodimensional grid only, is a cost effective distributed memory machine. Variants of this machine have been considered before, but none of them is so simple and pure as a sparsemesh. Various fundamental problems (routing, sorting, list ranking) ar ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
A sparsemesh, which has PUs on the diagonal of a twodimensional grid only, is a cost effective distributed memory machine. Variants of this machine have been considered before, but none of them is so simple and pure as a sparsemesh. Various fundamental problems (routing, sorting, list ranking) are analyzed, proving that sparsemeshes have a great potential. The results are extended for higher dimensional sparsemeshes. 1 Introduction On ordinary twodimensional meshes we must accept that, due to their small bisection width, for most problems the maximum achievable speedup with n 2 processing units (PUs) is only \Theta(n). On the other hand, networks such as hypercubes impose increasing conditions on the interconnection modules with increasing network sizes. Cubeconnectedcycles do not have this problem, but are harder to program due to their irregularity. Anyway, because of a basic theorem from VLSI layout [18], all planar architectures have an area that is quadratic in their...
Design and Implementation of a Practical I/Oefficient Shortest Paths Algorithm
"... We report on initial experimental results for a practical I/Oefficient SingleSource ShortestPaths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the reali ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We report on initial experimental results for a practical I/Oefficient SingleSource ShortestPaths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the realistic assumption holds that main memory is big enough to keep one bit per vertex. While our implementation only guarantees averagecase efficiency, i.e., assuming randomly chosen edgeweights, it turns out that its performance on realworld instances with nonrandom edge weights is actually even better than on the respective inputs with random weights. Furthermore, compared to the currently best implementation for externalmemory BFS [6], which in a sense constitutes a lower bound for SSSP, the running time of our approach always stayed within a factor of five, for the most difficult graph classes the difference was even less than a factor of two. We are not aware of any previous I/Oefficient implementation for the classic general SSSP in a (semi) external setting: in two recent projects [10, 23], Kumar/Schwabelike SSSP approaches on graphs of at most 6 million vertices have been tested, forcing the authors to artificially restrict the main memory size, M, to rather unrealistic 4 to 16 MBytes in order not to leave the semiexternal setting or produce huge running times for larger graphs: for random graphs of 2 20 vertices, the best previous approach needed over six hours. In contrast, for a similar ratio of input size vs. M, but on a 128 times larger and even sparser random graph, our approach was less than seven times slower, a relative gain of nearly 20. On a realworld 24 million node street graph, our implementation was over 40 times faster. Even larger gains of over 500 can be estimated for ran
Abstract Improved external memory BFS implementations ∗
"... Breadth first search (BFS) traversal on massive graphs in external memory was considered nonviable until recently, because of the large number of I/Os it incurs. Ajwani et al. [3] showed that the randomized variant of the o(n) I/O algorithm of Mehlhorn and Meyer [24] (MM BFS) can compute the BFS le ..."
Abstract
 Add to MetaCart
Breadth first search (BFS) traversal on massive graphs in external memory was considered nonviable until recently, because of the large number of I/Os it incurs. Ajwani et al. [3] showed that the randomized variant of the o(n) I/O algorithm of Mehlhorn and Meyer [24] (MM BFS) can compute the BFS level decomposition for large graphs (around a billion edges) in a few hours for small diameter graphs and a few days for large diameter graphs. We improve upon their implementation of this algorithm by reducing the overhead associated with each BFS level, thereby improving the results for large diameter graphs which are more difficult for BFS traversal in external memory. Also, we present the implementation of the deterministic variant of MM BFS and show that in most cases, it outperforms the randomized variant. The running time for BFS traversal is further improved with a heuristic that preserves the worst case guarantees of MM BFS. Together, they reduce the time for BFS on large diameter graphs from days shown in [3] to hours. In particular, on line graphs with random layout on disks, our implementation of the deterministic variant of MM BFS with the proposed heuristic is more than 75 times faster than the previous best result for the randomized variant of MM BFS in [3]. 1
Engineering a Topological Sorting Algorithm for Massive Graphs
"... We present an I/Oefficient algorithm for topologically sorting directed acyclic graphs (DAGs). No provably I/Oefficient algorithm for this problem is known. Similarly, the performance of our algorithm, which we call IterTS, may be poor in the worst case. However, our experiments show that IterTS a ..."
Abstract
 Add to MetaCart
We present an I/Oefficient algorithm for topologically sorting directed acyclic graphs (DAGs). No provably I/Oefficient algorithm for this problem is known. Similarly, the performance of our algorithm, which we call IterTS, may be poor in the worst case. However, our experiments show that IterTS achieves good performance in practise. The strategy of IterTS can be summarized as follows. We call an edge satisfied if its tail has a smaller number than its head. A numbering satisfying at least half the edges in the DAG is easy to find: a random numbering is expected to have this property. IterTS starts with such a numbering and then iteratively corrects the numbering to satisfy more and more edges until all edges are satisfied. To evaluate IterTS, we compared its running time to those of three competitors: PeelTS, an I/Oefficient implementation of the standard strategy of iteratively removing sources and sinks; ReachTS, an I/Oefficient implementation of a recent parallel divideandconquer algorithm based on reachability queries; and SeTS, standard DFSbased topological sorting built on top of a semiexternal DFS algorithm. In our evaluation on various types of input graphs, IterTS consistently outperformed PeelTS and ReachTS, by at least an order of magnitude in most cases. SeTS outperformed IterTS on most graphs whose vertex sets fit in memory. However, IterTS often came close to the running time of SeTS on these inputs and, more importantly, SeTS was not able to process graphs whose vertex sets were beyond the size of main memory, while IterTS was able to process such inputs efficiently.