Results 1  10
of
30
GPU Sample Sort
, 2009
"... In this paper, we present the design of a sample sort algorithm for manycore GPUs. Despite being one of the most efficient comparisonbased sorting algorithms for distributed memory architectures its performance on GPUs was previously unknown. For uniformly distributed keys our sample sort is at lea ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
In this paper, we present the design of a sample sort algorithm for manycore GPUs. Despite being one of the most efficient comparisonbased sorting algorithms for distributed memory architectures its performance on GPUs was previously unknown. For uniformly distributed keys our sample sort is at least 25 % and on average 68 % faster than the best comparisonbased sorting algorithm, GPU Thrust merge sort, and on average more than 2 times faster than GPU quicksort. Moreover, for 64bit integer keys it is at least 63 % and on average 2 times faster than the highly optimized GPU Thrust radix sort that directly manipulates the binary representation of keys. Our implementation is robust to different distributions and entropy levels of keys and scales almost linearly with the input size. These results indicate that multiway techniques in general and sample sort in particular achieve substantially better performance than twoway merge sort and quicksort.
The filterkruskal minimum spanning tree algorithm
, 2009
"... We present FilterKruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously” not in the MST. For arbitrary graphs with random edge weights FilterKruskal runs in time O (m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicate ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We present FilterKruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously” not in the MST. For arbitrary graphs with random edge weights FilterKruskal runs in time O (m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicate that the algorithm has very good practical performance over the entire range of edge densities. An equally simple parallelization seems to be the currently best practical algorithm on multicore machines.
Parallel TimeDependent Contraction Hierarchies
 Master’s thesis, Universität Karlsruhe (TH), Fakultät für Informatik
, 2009
"... TimeDependent Contraction Hierarchies is a routing technique that solves the shortest path problem in graphs with timedependent edge weights, that have to satisfy the FIFO property. Although it shows great speedups over Dijkstra’s Algorithm the preprocessing is slow. We present a parallelized vers ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
TimeDependent Contraction Hierarchies is a routing technique that solves the shortest path problem in graphs with timedependent edge weights, that have to satisfy the FIFO property. Although it shows great speedups over Dijkstra’s Algorithm the preprocessing is slow. We present a parallelized version of the preprocessing taking advantage of the multiple cores present in todays CPUs. Nodes independent of one another are found and processed in parallel. We give experimental results for the German road network. With 4 and 8 cores a speedup of up to 3.4 and 5.3 is achieved respectively. 1
ABSTRACT The GNU libstdc++ parallel mode: Software Engineering Considerations
"... The C++ Standard Library implementation provided with the free GNU C++ compiler, libstdc++, provides a “parallel mode ” as of version 4.3. Using this mode enables existing serial code to take advantage of many parallelized STL algorithms, an approach to making use of multicore processors which are ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
The C++ Standard Library implementation provided with the free GNU C++ compiler, libstdc++, provides a “parallel mode ” as of version 4.3. Using this mode enables existing serial code to take advantage of many parallelized STL algorithms, an approach to making use of multicore processors which are now or will soon will be ubiquitous. This paper describes the software engineering issues discovered during implementation, the results of user testing, and presents possible solutions to outstanding issues. Design issues with configuring the software environment to a wide variety of multicore hardware options, influencing algorithm and parameter choices at compile and run time, standards compliance, and the interplay between execution speed, the executable size, the library code size, and the compilation time are addressed.
Parallel and I/O Efficient Set Covering Algorithms
"... This paper presents the design, analysis, and implementation of parallel and sequential I/Oefficient algorithms for set cover, tying together the line of work on parallel set cover and the line of work on efficient set cover algorithms for large, diskresident instances. Our contributions are twofo ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
This paper presents the design, analysis, and implementation of parallel and sequential I/Oefficient algorithms for set cover, tying together the line of work on parallel set cover and the line of work on efficient set cover algorithms for large, diskresident instances. Our contributions are twofold: First, we design and analyze a parallel cacheoblivious setcover algorithm that offers essentially the same approximation guarantees as the standard greedy algorithm, which has the optimal approximation. Our algorithm is the first efficient externalmemory or cacheoblivious algorithm for when neither the sets nor the elements fit in memory, leading to I/O cost (cache complexity) equivalent to sorting in the Cache Oblivious or Parallel Cache Oblivious models. The algorithm also implies low cache misses on parallel hierarchical memories (again, equivalent to sorting). Second, building on this theory, we engineer variants of the theoretical algorithm optimized for different hardware setups. We provide experimental evaluation showing substantial speedups over existing algorithms without compromising the solution’s quality.
Building a parallel pipelined external memory algorithm library
 In 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS
, 2009
"... Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the wellestablished STXXL library provides a framework for external memory algorithms with an easytouse interface. However, the clock speed of processors cannot ke ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the wellestablished STXXL library provides a framework for external memory algorithms with an easytouse interface. However, the clock speed of processors cannot keep up with the increasing bandwidth of parallel disks, making many algorithms actually computebound. To overcome this steadily worsening limitation, we exploit today’s multicore processors with two new approaches. First, we parallelize the internal computation of the encapsulated external memory algorithms by utilizing the MCSTL library. Second, we augment the unique pipelining feature of the STXXL, to enable automatic task parallelization. We show using synthetic and practical use cases that the combination of both techniques increases performance greatly. 1
Algorithm Engineering  An Attempt at a Definition
"... This paper defines algorithm engineering as a general methodology for algorithmic research. The main process in this methodology is a cycle consisting of algorithm design, analysis, implementation and experimental evaluation that resembles Popper’s scientific method. Important additional issues are ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper defines algorithm engineering as a general methodology for algorithmic research. The main process in this methodology is a cycle consisting of algorithm design, analysis, implementation and experimental evaluation that resembles Popper’s scientific method. Important additional issues are realistic models, algorithm libraries, benchmarks with realworld problem instances, and a strong coupling to applications. Algorithm theory with its process of subsequent modelling, design, and analysis is not a competing approach to algorithmics but an important ingredient of algorithm engineering.
c ○ 2008 SCPE SINGLEPASS LIST PARTITIONING
"... Abstract. Parallel algorithms divide computation among several threads. In many cases, the input must also be divided. Consider an input consisting of a linear sequence of elements whose length is unknown a priori. We can evenly divide it naïvely by either traversing it twice (first determine length ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Parallel algorithms divide computation among several threads. In many cases, the input must also be divided. Consider an input consisting of a linear sequence of elements whose length is unknown a priori. We can evenly divide it naïvely by either traversing it twice (first determine length, then divide) or by using linear additional memory to hold an array of pointers to the elements. Instead, we propose an algorithm that divides a linear sequence into p parts of similar length traversing the sequence only once, and using sublinear additional space. The experiments show that our list partitioning algorithm is effective and fast in practice. Key words: parallel processing, sequences, algorithmic libraries 1. Introduction. An
Comparison Based Sorting for Systems with Multiple
"... As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for singleGPU systems. ..."
Abstract
 Add to MetaCart
(Show Context)
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for singleGPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system. In this paper we present a high performance multiGPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing singleGPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel interGPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.