Results 1  10
of
15
STXXL: Standard template library for XXL data sets
 In: Proc. of ESA 2005. Volume 3669 of LNCS
, 2005
"... for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
for processing huge data sets that can fit only on hard disks. It supports parallel disks, overlapping between disk I/O and computation and it is the first I/Oefficient algorithm library that supports the pipelining technique that can save more than half of the I/Os. STXXL has been applied both in academic and industrial environments for a range of problems including text processing, graph algorithms, computational geometry, gaussian elimination, visualization, and analysis of microscopic images, differential cryptographic analysis, etc. The performance of STXXL and its applications is evaluated on synthetic and realworld inputs. We present the design of the library, how its performance features are supported, and demonstrate how the library integrates with STL. KEY WORDS: very large data sets; software library; C++ standard template library; algorithm engineering 1.
Asynchronous Parallel Disk Sorting
 IN 15TH ACM SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES
, 2003
"... We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give a ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give an efficient implementation that can (at least) compete with the best practical implementations but gives additional performance guarantees. For the experiments we have configured a state of the art machine that can sustain full bandwidth I/O with eight disks and is very cost effective.
A framework for adaptive algorithm selection in STAPL
 IN PROC. ACM SIGPLAN SYMP. PRIN. PRAC. PAR. PROG. (PPOPP), PP 277–288
, 2005
"... Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distr ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
Writing portable programs that perform well on multiple platforms or for varying input sizes and types can be very difficult because performance is often sensitive to the system architecture, the runtime environment, and input data characteristics. This is even more challenging on parallel and distributed systems due to the wide variety of system architectures. One way to address this problem is to adaptively select the best parallel algorithm for the current input data and system from a set of functionally equivalent algorithmic options. Toward this goal, we have developed a general framework for adaptive algorithm selection for use in the Standard Template Adaptive Parallel Library (STAPL). Our framework uses machine learning techniques to analyze data collected by STAPL installation benchmarks and to determine tests that will select among algorithmic options at runtime. We apply a prototype implementation of our framework to two important parallel operations, sorting and matrix multiplication, on multiple platforms and show that the framework determines runtime tests that correctly select the best performing algorithm from among several competing algorithmic options in 86100 % of the cases studied, depending on the operation and the system.
Cache oblivious algorithms
 Algorithms for Memory Hierarchies, LNCS 2625
, 2003
"... Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data st ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Abstract. The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems. This model was first formulated in [22] and has since been a topic of intense research. Analyzing and designing algorithms and data structures in this model involves not only an asymptotic analysis of the number of steps executed in terms of the input size, but also the movement of data optimally among the different levels of the memory hierarchy. This chapter is aimed as an introduction to the “idealcache ” model of [22] and techniques used to design cache oblivious algorithms. The chapter also presents some experimental insights and results. Part of this work was done while the author was visiting MPISaarbrücken. The
Getting more from outofcore columnsort
 In 4th Workshop on Algorithm Engineering and Experiments (ALENEX 02
, 2002
"... Abstract. We describe two improvements to a previous implementation of outofcore columnsort, in which data reside on multiple disks. The first improvement replaces asynchronous I/O and communication calls by synchronous calls within a threaded framework. Experimental runs show that this improvemen ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
Abstract. We describe two improvements to a previous implementation of outofcore columnsort, in which data reside on multiple disks. The first improvement replaces asynchronous I/O and communication calls by synchronous calls within a threaded framework. Experimental runs show that this improvement reduces the running time to approximately half of the running time of the previous implementation. The second improvement uses algorithmic and engineering techniques to reduce the number of passes over the data from four to three. Experimental evidence shows that this improvement yields modest performance gains. We expect that the performance gain of this second improvement increases when the relative speed of processing and communication increases with respect to disk I/O speeds. Thus, as processing and communication become faster relative to I/O, this second improvement may yield better results than it currently does. 1
T.H.: Building on a Framework: Using FG for More Flexibility and Improved Performance in Parallel Programs
 In: 19th International Parallel and Distributed Processing Symposium (IPDPS 2005
, 2007
"... We describe new features of FG that are designed to improve performance and extend the range of computations that fit into its framework. FG (short for Framework Generator) is a programming environment for parallel programs running on clusters. It was originally designed to mitigate latency in acces ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We describe new features of FG that are designed to improve performance and extend the range of computations that fit into its framework. FG (short for Framework Generator) is a programming environment for parallel programs running on clusters. It was originally designed to mitigate latency in accessing data by running a program as a series of asynchronous stages that operate on buffers in a linear pipeline. To improve performance, FG now allows stages to be replicated, either statically by the programmer or dynamically by FG itself. FG also now alters thread priorities to use resources more efficiently; again, this action may be initiated by either the programmer or FG. To extend the range of computations that fit into its framework, FG now incorporates forkjoin and DAG structures. Not only do forkjoin and DAG structures allow for more programs to be designed for FG, but they also can enable significant performance improvements over linear pipeline structures. 1.
Adaptive data partition for sorting using probability distribution
 In Proceedings of the International Conference on Parallel Processing
, 2004
"... Many computing problems benefit from dynamic partition of data into smaller chunks with better parallelism and locality. However, it is difficult to partition all types of inputs with the same high efficiency. This paper presents a new partition method in sorting scenario based on probability distri ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Many computing problems benefit from dynamic partition of data into smaller chunks with better parallelism and locality. However, it is difficult to partition all types of inputs with the same high efficiency. This paper presents a new partition method in sorting scenario based on probability distribution, an idea first studied by Janus and Lamagna in early 1980’s on a mainframe computer. The new technique makes three improvements. The first is a rigorous sampling technique that ensures accurate estimate of the probability distribution. The second is an efficient implementation on modern, cachebased machines. The last is the use of probability distribution in parallel sorting. Experiments show 1030 % improvement in partition balance and 2070 % reduction in partition overhead, compared to two commonly used techniques. The new method reduces the parallel sorting time by 3350 % and outperforms the previous fastest sequential sorting technique by up to 30%. 1
Relaxing the problemsize bound for outofcore columnsort
 In Proceedings of the Fifteenth Annual ACM Symposium on Parallel Algorithms and Architectures
, 2003
"... Previous implementations of outofcore columnsort limit the problem size to N ≤ � (M/P) 3 /2, where N is the number of records to sort, P is the number of processors, and M is the total number of records that the entire system can hold in its memory (so that M/P is the number of records that a sin ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Previous implementations of outofcore columnsort limit the problem size to N ≤ � (M/P) 3 /2, where N is the number of records to sort, P is the number of processors, and M is the total number of records that the entire system can hold in its memory (so that M/P is the number of records that a single processor can hold in its memory). We implemented two variations to outofcore columnsort that relax this restriction. Subblock columnsort is based on an algorithmic modification of the underlying columnsort algorithm, and it improves the problemsize bound to N ≤ (M/P) 5/3 /4 2/3 but at the cost of additional disk I/O. Mcolumnsort changes the notion of the column size in columnsort, improving the maximum problem size to N ≤ � M 3 /2 but at the cost of additional computation and communication. Experimental results on a Beowulf cluster show that both subblock columnsort and Mcolumnsort run well but that Mcolumnsort is faster. A further advantage of Mcolumnsort is that it handles a wider range of problem sizes than subblock columnsort. This research was supported in part by NSF Grant EIA9802068. 1
PDM Sorting Algorithms That Take A Small Number Of Passes
 PASSES, PROC. INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS
, 2005
"... We live in an era of data explosion that necessitates the discovery of novel outofcore techniques. The I/O bottleneck has to be dealt with in developing outofcore methods. The Parallel Disk Model (PDM) has been proposed to alleviate the I/O bottleneck. Sorting is an important problem that has ub ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We live in an era of data explosion that necessitates the discovery of novel outofcore techniques. The I/O bottleneck has to be dealt with in developing outofcore methods. The Parallel Disk Model (PDM) has been proposed to alleviate the I/O bottleneck. Sorting is an important problem that has ubiquitous applications. Several asymptotically optimal PDM sorting algorithms are known and now the focus has shifted to developing algorithms for problem sizes of practical interest. In this paper we present several novel algorithms for sorting on the PDM that take only a small number of passes through the data. We also present a generalization of the zeroone principle for sorting. A shuffling lemma is presented as well. These lemmas should be of independent interest for average case analysis of sorting algorithms as well as for the analysis of randomized sorting algorithms.
Oblivious vs. distributionbased sorting: An experimental evaluation
"... We compare two algorithms for sorting outofcore data on a distributedmemory cluster. One algorithm, Csort, is a 3pass oblivious algorithm. The other, Dsort, makes three passes over the data and is based on the paradigm of distributionbased algorithms. In the context of outofcore sorting, this ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
We compare two algorithms for sorting outofcore data on a distributedmemory cluster. One algorithm, Csort, is a 3pass oblivious algorithm. The other, Dsort, makes three passes over the data and is based on the paradigm of distributionbased algorithms. In the context of outofcore sorting, this study is the first comparison between the paradigms of distributionbased and oblivious algorithms. Dsort avoids two of the four steps of a typical distributionbased algorithm by making simplifying assumptions about the distribution of the input keys. Csort makes no assumptions about the keys. Despite the simplifying assumptions, the I/O and communication patterns of Dsort depend heavily on the exact sequence of input keys. Csort, on the other hand, takes advantage of predetermined I/O and communication patterns, governed entirely by the input size in order to overlap computation, communication, and I/O. Experimental evidence shows that, even on inputs that followed Dsort’s simplifying assumptions, Csort fared well. The running time of Dsort showed great variation across five input cases, whereas Csort sorted all of them in approximately the same amount of time. In fact, Dsort ran significantly faster than Csort in just one out of the five input cases: the one that was the most unrealistically skewed in favor of Dsort. A more robust implementation of Dsort—one without the simplifying assumptions—would run even slower. 1