Results 1  10
of
40
The influence of caches on the performance of sorting
 IN PROCEEDINGS OF THE SEVENTH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 1997
"... We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all t ..."
Abstract

Cited by 124 (4 self)
 Add to MetaCart
(Show Context)
We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all three algorithms the improvementincache performance leads to a reduction in total execution time. We also investigate the performance of radix sort. Despite the extremely low instruction count incurred by this linear time sorting algorithm, its relatively poor cache performance results in worse overall performance than the e cient comparison based sorting algorithms. For each algorithm we provide an analysis that closely predicts the number of cache misses incurred by the algorithm.
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 74 (5 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Efficient parallel graph algorithms for coarse grained multicomputers and BSP (Extended Abstract)
 in Proc. 24th International Colloquium on Automata, Languages and Programming (ICALP'97
, 1997
"... In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulksynchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and s ..."
Abstract

Cited by 63 (23 self)
 Add to MetaCart
In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulksynchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2edge connectivity and biconnectivity (testing and component computation), and (8) cordal graph recognition (finding a perfect elimination ordering). The algorithms for Problems 17 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2, i.e.they are fully scalable, and for Problems hold for arbitrary ratios n p 38 it is assumed that n p,>0, which is true for all commercially
Subscription partitioning and routing in contentbased publish/subscribe systems
 In DISC 2002: Proceedings Of International Symposium on Distributed Computing
, 2004
"... Abstract — Contentbased publish/subscribe systems allow subscribers to specify events of interest based on event contents, beyond preassigned event topics. When networks of servers are used to provide scalable contentbased publish/subscribe services, we have the flexibility of partitioning exist ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
(Show Context)
Abstract — Contentbased publish/subscribe systems allow subscribers to specify events of interest based on event contents, beyond preassigned event topics. When networks of servers are used to provide scalable contentbased publish/subscribe services, we have the flexibility of partitioning existing subscriptions and routing new subscriptions among multiple servers to optimize various performance metrics including total network traffic, load balancing, and system throughput. We propose two approaches to subscription partitioning and routing, one based on partitioning the event space and the other based on partitioning the subscription set, and discuss their tradeoffs. Finally, we collect and analyze a set of realworld stockquote subscriptions and use that as the basis for our simulation study to demonstrate the effectiveness of the proposed schemes. I.
Randomized Parallel List Ranking For Distributed Memory Multiprocessors
, 1996
"... We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP like model. We first describe a simple version which requires, with high probability, log(3p) + log ln(n) = ~ O(logp+ log log n) communication rounds (hrelations with h = ~ O( n p )) and ~ O ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP like model. We first describe a simple version which requires, with high probability, log(3p) + log ln(n) = ~ O(logp+ log log n) communication rounds (hrelations with h = ~ O( n p )) and ~ O( n p ) local computation. We then outline an improved version which requires, with high probability, only r (4k + 6) log( 2 3 p) + 8 = ~ O(k log p) communication rounds where k = minfi 0j ln (i+1) n ( 2 3 p) 2i+1 g. Note that k ! ln (n) is an extremely small number. For n 10 10 100 and p 4, the value of k is at most 2. Hence, for a given number of processors, p, the number of communication rounds required is, for all practical purposes, independent of n. For n 1; 500; 000 and 4 p 2048, the number of communication rounds in our algorithm is bounded, with high probability, by 78, but the actual number of communication rounds observed so far is 25 in the worst case. Fo...
A Randomized Sorting Algorithm on the BSP model
 IN PROCEEDINGS OF IPPS
, 1997
"... We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search al ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search algorithms for a class of data structures on the BSP model. Moreover, our methods are within a 1+o(1) multiplicative factor of the respective sequential methods in terms of speedup for a wide range of the BSP parameters.
Towards a Scalable Parallel Object Database  The Bulk Synchronous Parallel Approach
, 1996
"... Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in nonnumerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make generalpurpose parallel computing co ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in nonnumerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make generalpurpose parallel computing costeffective, the requirements for nonnumerical (or symbolic) applications, and the previous attempts to develop parallel databases. The central theme of the Bulk Synchronous Parallel model is to provide a high level abstraction of parallel computing hardware whilst providing a realisation of a parallel programming model that enables architecture independent programs to deliver scalable performance on diverse hardware platforms. Therefore, the primary objective of this report is to investigate the feasibility of developing a portable, scalable, parallel object database, based on the Bulk Synchronous Parallel model of computation. In particular, we devise a way of providing highlevel abstra...
A comparison of parallel sorting algorithms on different architectures
, 1996
"... In this paper, we present a comparative performance evaluation of three different parallel sorting algorithms: bitonic sort, sample sort, and parallel radix sort. In order to study the interaction between the algorithms and the architecture, we implemented all the algorithms on three different archi ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we present a comparative performance evaluation of three different parallel sorting algorithms: bitonic sort, sample sort, and parallel radix sort. In order to study the interaction between the algorithms and the architecture, we implemented all the algorithms on three different architectures: a MasPar MP1202, a meshconnected computer with 2048 processing elements; an nCUBE 2, a messagepassing hypercube with 32 processors; and a Sequent Balance, a distributed sharedmemory machine with 10 processors. For each machine, we found that the choice of algorithm depends upon the number of elements to be sorted. In addition, as expected, our results show that the relative performance of the algorithms differed on the various machines. It is our hope that our results can be extrapolated to help select appropriate candidates for implementation on machines with architectures similar to those that we have studied. As evidence for this, our findings on the nCUBE 2, a 32 node hypercube, are in accordance with the results obtained by Blelloch et al. [5] on the CM2, a hypercube with 1024 processors. In addition, preliminary results we have obtained on the SGI Power Challenge, a distributed sharedmemory machine, are in accordance with our findings on the Sequent Balance.
Load Balancing of Irregular Parallel DivideandConquer Algorithms in GroupSPMD Programming Environments
 Master’s Thesis, PELAB, Linköpings Universitet
, 2006
"... We study strategies for local load balancing of irregular parallel divideandconquer algorithms such as Quicksort and Quickhull in SPMDparallel environments such as MPI and Fork that allow to exploit nested parallelism by dynamic group splitting. ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
We study strategies for local load balancing of irregular parallel divideandconquer algorithms such as Quicksort and Quickhull in SPMDparallel environments such as MPI and Fork that allow to exploit nested parallelism by dynamic group splitting.