Results 1 -
8 of
8
Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism
- Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1998
"... The paper envisions an extension to a standard instruction set which efficiently implements PRAM algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms throu ..."
Abstract
-
Cited by 24 (11 self)
- Add to MetaCart
The paper envisions an extension to a standard instruction set which efficiently implements PRAM algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced; new elements are added where needed. The more detailed presentation is by way of a bridging model. Among other things, a bridging model provides a design space for algorithm designers and programmers, as well as a design space for computer architects. It is convenient to describe our wider vision regarding "parallel-computing-on-a-chip" as a two-stage development and therefore two bridging models are presented: Spawn-based multi-threading (Spawn-MT) and Elastic multi-threading (EMT). The case for Spawn-MT (or, alternatively, EMT) as a bridging model relies on the following evidence. (1) Spawn-MT comprises an "instruction set level", wh...
Accessing Multiple Sequences Through Set Associative Caches
- In Proc
, 1999
"... The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is n ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms which perform well in practice. We start from the empirical observation that external memory algorithms often turn out to be good algorithms for cached memory. This is not self evident since caches have a fixed and quite restrictive algorithm choosing the content of the cache. We investigate the impact of this restriction for the frequently occurring case of access to multiple sequences. We show that any access pattern to k = \Theta(M=B ) sequential data streams can be efficiently supported on an a-way set associative cache with capacity M and line size B. The bounds are tight up to lower order terms.
Ultimate Parallel List Ranking?
- Journal of Parallel and Distributed Computing
, 2000
"... Two improved list-ranking algorithms are presented. The "peeling-off" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algorithms wher ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
Two improved list-ranking algorithms are presented. The "peeling-off" algorithm leads to an optimal PRAM algorithm, but was designed with application on a real parallel machine in mind. It is simpler than earlier algorithms, and in a range of problem sizes, where previously several algorithms where required for the best performance, now this single algorithm suffices. If the problem size is much larger than the number of available processors, then the "sparse-ruling-sets" algorithm is even better. In previous versions this algorithm had very restricted practical application because of the large number of communication rounds it was performing. This main weakness of this algorithm is overcome by adding two new ideas, each of which reduces the number of communication rounds by a factor of two. 1 Introduction A list is a basic data structure: it consists of nodes which are linked together, so that every node has precisely one predecessor and one successor, except for the initial n...
Solving Fundamental Problems on Sparse-Meshes
- IEEE Transactions on Parallel & Distributed Systems
, 1998
"... A sparse-mesh, which has PUs on the diagonal of a two-dimensional grid only, is a cost effective distributed memory machine. Variants of this machine have been considered before, but none of them is so simple and pure as a sparse-mesh. Various fundamental problems (routing, sorting, list ranking) ar ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
A sparse-mesh, which has PUs on the diagonal of a two-dimensional grid only, is a cost effective distributed memory machine. Variants of this machine have been considered before, but none of them is so simple and pure as a sparse-mesh. Various fundamental problems (routing, sorting, list ranking) are analyzed, proving that sparse-meshes have a great potential. The results are extended for higher dimensional sparse-meshes. 1 Introduction On ordinary two-dimensional meshes we must accept that, due to their small bisection width, for most problems the maximum achievable speed-up with n 2 processing units (PUs) is only \Theta(n). On the other hand, networks such as hypercubes impose increasing conditions on the interconnection modules with increasing network sizes. Cube-connected-cycles do not have this problem, but are harder to program due to their irregularity. Anyway, because of a basic theorem from VLSI lay-out [18], all planar architectures have an area that is quadratic in their...
Scanning Multiple Sequences Via Cache Memory
- Algorithmica
, 2003
"... We consider the simple problem of scanning multiple sequences. There are k sequences of total length N which are to be scanned concurrently. One pointer into each sequence is maintained and an adversary specifies which pointer is to be advanced. The concept of scanning multiple sequence is ubiquitou ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We consider the simple problem of scanning multiple sequences. There are k sequences of total length N which are to be scanned concurrently. One pointer into each sequence is maintained and an adversary specifies which pointer is to be advanced. The concept of scanning multiple sequence is ubiquitous in algorithms designed for hierarchical memory.
Design and Implementation of a Practical I/O-efficient Shortest Paths Algorithm
"... We report on initial experimental results for a practical I/O-efficient Single-Source Shortest-Paths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the reali ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We report on initial experimental results for a practical I/O-efficient Single-Source Shortest-Paths (SSSP) algorithm on general undirected sparse graphs where the ratio between the largest and the smallest edge weight is reasonably bounded (for example integer weights in {1,...,2 32}) and the realistic assumption holds that main memory is big enough to keep one bit per vertex. While our implementation only guarantees average-case efficiency, i.e., assuming randomly chosen edge-weights, it turns out that its performance on real-world instances with non-random edge weights is actually even better than on the respective inputs with random weights. Furthermore, compared to the currently best implementation for external-memory BFS [6], which in a sense constitutes a lower bound for SSSP, the running time of our approach always stayed within a factor of five, for the most difficult graph classes the difference was even less than a factor of two. We are not aware of any previous I/O-efficient implementation for the classic general SSSP in a (semi) external setting: in two recent projects [10, 23], Kumar/Schwabe-like SSSP approaches on graphs of at most 6 million vertices have been tested, forcing the authors to artificially restrict the main memory size, M, to rather unrealistic 4 to 16 MBytes in order not to leave the semi-external setting or produce huge running times for larger graphs: for random graphs of 2 20 vertices, the best previous approach needed over six hours. In contrast, for a similar ratio of input size vs. M, but on a 128 times larger and even sparser random graph, our approach was less than seven times slower, a relative gain of nearly 20. On a real-world 24 million node street graph, our implementation was over 40 times faster. Even larger gains of over 500 can be estimated for ran-
Engineering a Topological Sorting Algorithm for Massive Graphs
"... We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs (DAGs). No provably I/O-efficient algorithm for this problem is known. Similarly, the performance of our algorithm, which we call IterTS, may be poor in the worst case. However, our experiments show that IterTS a ..."
Abstract
- Add to MetaCart
We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs (DAGs). No provably I/O-efficient algorithm for this problem is known. Similarly, the performance of our algorithm, which we call IterTS, may be poor in the worst case. However, our experiments show that IterTS achieves good performance in practise. The strategy of IterTS can be summarized as follows. We call an edge satisfied if its tail has a smaller number than its head. A numbering satisfying at least half the edges in the DAG is easy to find: a random numbering is expected to have this property. IterTS starts with such a numbering and then iteratively corrects the numbering to satisfy more and more edges until all edges are satisfied. To evaluate IterTS, we compared its running time to those of three competitors: PeelTS, an I/O-efficient implementation of the standard strategy of iteratively removing sources and sinks; ReachTS, an I/O-efficient implementation of a recent parallel divide-and-conquer algorithm based on reachability queries; and SeTS, standard DFS-based topological sorting built on top of a semi-external DFS algorithm. In our evaluation on various types of input graphs, IterTS consistently outperformed PeelTS and ReachTS, by at least an order of magnitude in most cases. SeTS outperformed IterTS on most graphs whose vertex sets fit in memory. However, IterTS often came close to the running time of SeTS on these inputs and, more importantly, SeTS was not able to process graphs whose vertex sets were beyond the size of main memory, while IterTS was able to process such inputs efficiently.

