Results 1  10
of
35
CommunicationEfficient Parallel Sorting
, 1996
"... We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sort ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
We study the problem of sorting n numbers on a pprocessor bulksynchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processortoprocessor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O( n log n p ) and a number of communication rounds that is O( log n log(h+1) ) for h = \Theta(n=p). The internal computation bound is optimal for any comparisonbased sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p n 1\Gamma1=c for a constant c 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n=h) of an arbitrary number of processors in a BSP computer requires\Omega\Gammaqui n= log(h...
Efficient parallel graph algorithms for coarse grained multicomputers and BSP (Extended Abstract)
 in Proc. 24th International Colloquium on Automata, Languages and Programming (ICALP'97
, 1997
"... In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulksynchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and s ..."
Abstract

Cited by 59 (23 self)
 Add to MetaCart
In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulksynchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2edge connectivity and biconnectivity (testing and component computation), and (8) cordal graph recognition (finding a perfect elimination ordering). The algorithms for Problems 17 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2, i.e.they are fully scalable, and for Problems hold for arbitrary ratios n p 38 it is assumed that n p,>0, which is true for all commercially
Efficient External Memory Algorithms by Simulating CoarseGrained Parallel Algorithms
, 2003
"... External memory (EM) algorithms are designed for largescale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to ..."
Abstract

Cited by 41 (10 self)
 Add to MetaCart
External memory (EM) algorithms are designed for largescale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACMWorking Group on Storage I/O for LargeScale Computing.
Design and Implementation of a Practical Parallel Delaunay Algorithm
, 1999
"... This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for unif ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations) from O(n log^2 n) to O(n log n). The depth (parallel time) is O(log^3 n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants. Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPIbased toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster.
Developing a Practical ProjectionBased Parallel Delaunay Algorithm
 in 12th Annual Symposium on Computational Geometry
, 1996
"... In this paper we are concerned with developing a practical parallel algorithm for Delaunay triangulation that works well on general distributions, particularly those that arise in Scientific Computation. Although there have been many theoretical algorithms for the problem, and some implementations b ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
In this paper we are concerned with developing a practical parallel algorithm for Delaunay triangulation that works well on general distributions, particularly those that arise in Scientific Computation. Although there have been many theoretical algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to 3D convex hull of points on a sphere or paraboloid. A variant of the Edelsbrunner and Shi 3D convex hull is used, but for the special case when the point set lies on either a sphere or a paraboloid. Our variant greatly reduces the constant costs from the 3D convex hull algorithm and seems to be a more promising for a practical implementation than other parallel approaches. We have run experiments on the algorithm using a variety of distributions that are motivated by various problems that use Delau...
I/OEfficient Construction of Voronoi Diagrams
, 2002
"... We consider the problems of computing 2 and 3d Voronoi diagrams for large data sets efficiently. We describe a cacheoblivious distribution data structure (bu#er tree) that is the basis for the cache oblivious implementation of a random incremental construction for geometric problems. We then a ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
We consider the problems of computing 2 and 3d Voronoi diagrams for large data sets efficiently. We describe a cacheoblivious distribution data structure (bu#er tree) that is the basis for the cache oblivious implementation of a random incremental construction for geometric problems. We then apply this to the construction of 2 and 3d Voronoi diagrams. We also describe a very simple variant of the standard random incremental construction based on history dag, which has optimal running time and is likely to be I/Oefficient because the pattern of insertions is also local (but we don't have theoretical bounds). Finally, we describe a practical variant that has been implemeted and present some experimental results.
Computing the arrangement of curve segments: Divideandconquer algorithms via sampling
 IN PROC. OF THE 11TH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 2000
"... We describe two deterministic algorithms for constructing the arrangement determined by a set of (algebraic) curve segments in the plane. They both use a divideandconquer approach based on derandomized geometric sampling and achieve the optimal running time O(n log n + k), where n is the number of ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
We describe two deterministic algorithms for constructing the arrangement determined by a set of (algebraic) curve segments in the plane. They both use a divideandconquer approach based on derandomized geometric sampling and achieve the optimal running time O(n log n + k), where n is the number of segments and k is the number of intersections. The rst algorithm, a simpli ed version of one presented in [1], generates a structure of size O(n log log n + k) and its parallel implementation runs in time O(log 2 n). The second algorithm is better in that the decomposition of the arrangement constructed has optimal size O(n + k) and it has a parallel implementation in the EREW PRAM model that runs in time O(log 3=2 n). The improvements in the second algorithm are achieved by means of an approach
Randomized Parallel List Ranking For Distributed Memory Multiprocessors
, 1996
"... We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP like model. We first describe a simple version which requires, with high probability, log(3p) + log ln(n) = ~ O(logp+ log log n) communication rounds (hrelations with h = ~ O( n p )) and ~ O ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
We present a randomized parallel list ranking algorithm for distributed memory multiprocessors, using a BSP like model. We first describe a simple version which requires, with high probability, log(3p) + log ln(n) = ~ O(logp+ log log n) communication rounds (hrelations with h = ~ O( n p )) and ~ O( n p ) local computation. We then outline an improved version which requires, with high probability, only r (4k + 6) log( 2 3 p) + 8 = ~ O(k log p) communication rounds where k = minfi 0j ln (i+1) n ( 2 3 p) 2i+1 g. Note that k ! ln (n) is an extremely small number. For n 10 10 100 and p 4, the value of k is at most 2. Hence, for a given number of processors, p, the number of communication rounds required is, for all practical purposes, independent of n. For n 1; 500; 000 and 4 p 2048, the number of communication rounds in our algorithm is bounded, with high probability, by 78, but the actual number of communication rounds observed so far is 25 in the worst case. Fo...
Reducing i/o complexity by simulating coarse grained parallel algorithms
 In Proc. IPPS/SPDP
, 1999
"... Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a deterministic simulation technique which transforms parallel algorithms into ( ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a deterministic simulation technique which transforms parallel algorithms into (parallel) external memory algorithms. Specifically, we present a deterministic simulation technique which transforms Coarse Grained Multicomputer (CGM) algorithms into external memory algorithms for the Parallel Disk Model. Our technique optimizes blockwise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel external memory algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including 3D convex hulls (2D Voronoi diagrams), and various graph problems. All of the (parallel) external memory algorithms obtained via simulation are analyzed with respect to the computation time, communication time and the number of I/O’s. Our results answer to the challenge posed by the ACM working group on storage I/O for largescale computing [8]. 1
A Note On Coarse Grained Parallel Integer Sorting
 Parallel Processing Letters
, 1999
"... We observe that for n=p p, which is usually the case in practice, there exists a very simple, deterministic, optimal coarse grained parallel integer sorting p relations and 18 prelations), O(n=p) memory per processor and O(n=p) local computation. Experimental data indicates that the algorithm ha ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
We observe that for n=p p, which is usually the case in practice, there exists a very simple, deterministic, optimal coarse grained parallel integer sorting p relations and 18 prelations), O(n=p) memory per processor and O(n=p) local computation. Experimental data indicates that the algorithm has very good performance in practice.