Results 1  10
of
38
The influence of caches on the performance of sorting
 IN PROCEEDINGS OF THE SEVENTH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 1997
"... We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all t ..."
Abstract

Cited by 121 (4 self)
 Add to MetaCart
(Show Context)
We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all three algorithms the improvementincache performance leads to a reduction in total execution time. We also investigate the performance of radix sort. Despite the extremely low instruction count incurred by this linear time sorting algorithm, its relatively poor cache performance results in worse overall performance than the e cient comparison based sorting algorithms. For each algorithm we provide an analysis that closely predicts the number of cache misses incurred by the algorithm.
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 65 (2 self)
 Add to MetaCart
(Show Context)
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Efficient parallel graph algorithms for coarse grained multicomputers and BSP (Extended Abstract)
 in Proc. 24th International Colloquium on Automata, Languages and Programming (ICALP'97
, 1997
"... In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulksynchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and s ..."
Abstract

Cited by 64 (24 self)
 Add to MetaCart
In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulksynchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2edge connectivity and biconnectivity (testing and component computation), and (8) cordal graph recognition (finding a perfect elimination ordering). The algorithms for Problems 17 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2, i.e.they are fully scalable, and for Problems hold for arbitrary ratios n p 38 it is assumed that n p,>0, which is true for all commercially
Subscription partitioning and routing in contentbased publish/subscribe systems
 In DISC 2002: Proceedings Of International Symposium on Distributed Computing
, 2004
"... Abstract — Contentbased publish/subscribe systems allow subscribers to specify events of interest based on event contents, beyond preassigned event topics. When networks of servers are used to provide scalable contentbased publish/subscribe services, we have the flexibility of partitioning exist ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Abstract — Contentbased publish/subscribe systems allow subscribers to specify events of interest based on event contents, beyond preassigned event topics. When networks of servers are used to provide scalable contentbased publish/subscribe services, we have the flexibility of partitioning existing subscriptions and routing new subscriptions among multiple servers to optimize various performance metrics including total network traffic, load balancing, and system throughput. We propose two approaches to subscription partitioning and routing, one based on partitioning the event space and the other based on partitioning the subscription set, and discuss their tradeoffs. Finally, we collect and analyze a set of realworld stockquote subscriptions and use that as the basis for our simulation study to demonstrate the effectiveness of the proposed schemes. I.
A Randomized Sorting Algorithm on the BSP model
 IN PROCEEDINGS OF IPPS
, 1997
"... We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search al ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
We present a new randomized sorting algorithm on the BulkSynchronousParallel (BSP) model. The algorithm improves upon the parallel slack of previous algorithms to achieve optimality. Tighter probabilistic bounds are also established. It uses sample sorting and utilizes recently introduced search algorithms for a class of data structures on the BSP model. Moreover, our methods are within a 1+o(1) multiplicative factor of the respective sequential methods in terms of speedup for a wide range of the BSP parameters.
Randomized Parallel List Ranking For Distributed Memory Multiprocessors
, 1996
"... We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP like model. We first describe a simple version which requires, with high probability, log(3p) + log ln(n) = ~ O(logp+ log log n) communication rounds (hrelations with h = ~ O( n p )) and ~ O ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP like model. We first describe a simple version which requires, with high probability, log(3p) + log ln(n) = ~ O(logp+ log log n) communication rounds (hrelations with h = ~ O( n p )) and ~ O( n p ) local computation. We then outline an improved version which requires, with high probability, only r (4k + 6) log( 2 3 p) + 8 = ~ O(k log p) communication rounds where k = minfi 0j ln (i+1) n ( 2 3 p) 2i+1 g. Note that k ! ln (n) is an extremely small number. For n 10 10 100 and p 4, the value of k is at most 2. Hence, for a given number of processors, p, the number of communication rounds required is, for all practical purposes, independent of n. For n 1; 500; 000 and 4 p 2048, the number of communication rounds in our algorithm is bounded, with high probability, by 78, but the actual number of communication rounds observed so far is 25 in the worst case. Fo...
Towards a Scalable Parallel Object Database  The Bulk Synchronous Parallel Approach
, 1996
"... Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in nonnumerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make generalpurpose parallel computing co ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in nonnumerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make generalpurpose parallel computing costeffective, the requirements for nonnumerical (or symbolic) applications, and the previous attempts to develop parallel databases. The central theme of the Bulk Synchronous Parallel model is to provide a high level abstraction of parallel computing hardware whilst providing a realisation of a parallel programming model that enables architecture independent programs to deliver scalable performance on diverse hardware platforms. Therefore, the primary objective of this report is to investigate the feasibility of developing a portable, scalable, parallel object database, based on the Bulk Synchronous Parallel model of computation. In particular, we devise a way of providing highlevel abstra...
Load Balancing of Irregular Parallel DivideandConquer Algorithms in GroupSPMD Programming Environments
 Master’s Thesis, PELAB, Linköpings Universitet
, 2006
"... We study strategies for local load balancing of irregular parallel divideandconquer algorithms such as Quicksort and Quickhull in SPMDparallel environments such as MPI and Fork that allow to exploit nested parallelism by dynamic group splitting. ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
We study strategies for local load balancing of irregular parallel divideandconquer algorithms such as Quicksort and Quickhull in SPMDparallel environments such as MPI and Fork that allow to exploit nested parallelism by dynamic group splitting.
ddimensional range search on multicomputers
 Proc. 11th International Parallel Processing Symposium (IPPS'97
, 1996
"... The range tree is a fundamental data structure for multidimensional point sets, and as such, is central in a wide range of geometric and database applications. In this paper, we describe the rst nontrivial adaptation of range trees to the parallel distributed memory setting (BSPlike models). Give ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
The range tree is a fundamental data structure for multidimensional point sets, and as such, is central in a wide range of geometric and database applications. In this paper, we describe the rst nontrivial adaptation of range trees to the parallel distributed memory setting (BSPlike models). Given a set of n points in ddimensional Cartesian space, we showhow to construct a distributed range tree T on a coarse grained multicomputer in time O(s=p + Tc(s � p)), where s = n log d;1 n is the size of the sequential data structure and Tc(s � p) is the time to perform an hrelation with h = (s=p). We then show how T can be used to answer a given set Q of m = O(n) range queries in time O((s log m)=p + Tc(s � p)) and O((s log m)=p + Tc(s � p) +k=p), for the associativefunction and report modes respectively, where k is the number of results to be reported. These parallel construction and search algorithms are both highly e cient, in that their running times are the sequential time divided by thenumber of processors, plus a constant number of parallel communication rounds. 1