Results 1 -
8 of
8
Delaunay Triangulation with Transactions and Barriers
- IEEE Intl. Symp. on Workload Characterization
, 2007
"... Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallel ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallelization strategy. The code is written in C++, for use with the RSTM software transactional memory library (also open source). It employs one of the fastest known sequential algorithms to triangulate geometrically partitioned regions in parallel; it then employs alternating, barrier-separated phases of transactional and partitioned work to stitch those regions together. Experiments on multiprocessor and multicore machines confirm excellent single-thread performance and good speedup with increasing thread count. Since execution time is dominated by geometrically partitioned computation, performance is largely insensitive to the overhead of transactions, but highly sensitive to any costs imposed on sharable data that are currently “privatized”. 1.
Alchemist: A transparent dependence distance profiling infrastructure
- In CGO ’09: Proceedings of the 2009 International Symposium on Code Generation and Optimization
, 2009
"... Abstract—Effectively migrating sequential applications to take advantage of parallelism available on multicore platforms is a well-recognized challenge. This paper addresses important aspects of this issue by proposing a novel profiling technique to automatically detect available concurrency in C pr ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Abstract—Effectively migrating sequential applications to take advantage of parallelism available on multicore platforms is a well-recognized challenge. This paper addresses important aspects of this issue by proposing a novel profiling technique to automatically detect available concurrency in C programs. The profiler, called Alchemist, operates completely transparently to applications, and identifies constructs at various levels of granularity (e.g., loops, procedures, and conditional statements) as candidates for asynchronous execution. Various dependences including read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW), are detected between a construct and its continuation, the execution following the completion of the construct. The time-ordered distance between program points forming a dependence gives a measure of the effectiveness of parallelizing that construct, as well as identifying the transformations necessary to facilitate such parallelization. Using the notion of post-dominance, our profiling algorithm builds an execution index tree at run-time. This tree is used to differentiate among multiple instances of the same static construct, and leads to improved accuracy in the computed profile, useful to better identify constructs that are amenable to parallelization. Performance results indicate that the profiles generated by Alchemist pinpoint strong candidates for parallelization, and can help significantly ease the burden of application migration to multicore environments. Keywords-profiling; program dependence; parallelization; execution indexing I.
Practical Parallel Divide-and-Conquer Algorithms
, 1997
"... Nested data parallelism has been shown to be an important feature of parallel languages, allowing the concise expression of algorithms that operate on irregular data structures such as graphs and sparse matrices. However, previous nested dataparallel languages have relied on a vector PRAM impleme ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Nested data parallelism has been shown to be an important feature of parallel languages, allowing the concise expression of algorithms that operate on irregular data structures such as graphs and sparse matrices. However, previous nested dataparallel languages have relied on a vector PRAM implementation layer that cannot be efficiently mapped to MPPs with high inter-processor latency. This thesis shows that by restricting the problem set to that of data-parallel divide-and-conquer algorithms I can maintain the expressibility of full nested data-parallel languages while achieving good efficiency on current distributed-memory machines. Specifically, I define
Compact Data Structures with Fast Queries
, 2005
"... Many applications dealing with large data structures can benefit from keeping them in compressed form. Compression has many benefits: it can allow a representation to fit in main memory rather than swapping out to disk, and it improves cache performance since it allows more data to fit into the c ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Many applications dealing with large data structures can benefit from keeping them in compressed form. Compression has many benefits: it can allow a representation to fit in main memory rather than swapping out to disk, and it improves cache performance since it allows more data to fit into the cache. However, a data structure is only useful if it allows the application to perform fast queries (and updates) to the data.
Parallel Poisson Surface Reconstruction
"... Abstract. In this work we describe a parallel implementation of the Poisson Surface Reconstruction algorithm based on multigrid domain decomposition. We compare implementations using different models of data-sharing between processors and show that a parallel implementation with distributed memory p ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. In this work we describe a parallel implementation of the Poisson Surface Reconstruction algorithm based on multigrid domain decomposition. We compare implementations using different models of data-sharing between processors and show that a parallel implementation with distributed memory provides the best scalability. Using our method, we are able to parallelize the reconstruction of models from one billion data points on twelve processors across three machines, providing a ninefold speedup in running time without sacrificing reconstruction accuracy. 1
Engineering a compact parallel delaunay algorithm in 3d
- In Proceedings of the ACM Symposium on Computational Geometry
, 2006
"... We describe an implementation of a compact parallel algorithm for 3D Delaunay tetrahedralization on a 64-processor shared-memory machine. Our algorithm uses a concurrent version of the Bowyer-Watson incremental insertion, and a thread-safe space-efficient structure for representing the mesh. Using t ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We describe an implementation of a compact parallel algorithm for 3D Delaunay tetrahedralization on a 64-processor shared-memory machine. Our algorithm uses a concurrent version of the Bowyer-Watson incremental insertion, and a thread-safe space-efficient structure for representing the mesh. Using the implementation we are able to generate significantly larger Delaunay meshes than have previously been generated—10 billion tetrahedra on a 64 processor SMP using 200GB of RAM. The implementation makes use of a locality based relabeling of the vertices that serves three purposes—it is used as part of the space efficient representation, it improves the memory locality, and it reduces the overhead necessary for locks. The implementation also makes use of a caching technique to avoid excessive decoding of vertex information, a technique for backing out of insertions that collide, and a shared work queue for maintaining points that have yet to be inserted.
Implementation and Evaluation of an Efficient 2D Parallel Delaunay Triangulation Algorithm
, 1997
"... This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not limited to datasets with a uniform distribution of points, achiev ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not limited to datasets with a uniform distribution of points, achieves significantly better speedups over good serial code, and is widely portable due to its use of MPI as a communications mechanism. Results are presented for a loosely-coupled cluster of workstations, two distributed-memory multicomputers, and a shared-memory multiprocessor. The Machiavelli toolkit used to transform the nested data parallelism inherent in the divide-and-conquer algorithm into achievable task and data parallelism is also described and compared to previous techniques.
Parallel 2D Delaunay Triangulations in HPF and MPI
"... This paper reports on efficient parallel implementations of two-dimensional Delaunay triangulation in High Performance Fortran (HPF) and in Message Passing Interface (MPI). Our parallelization algorithm performs subblock triangulation and boundary merge independently at the same time. The sub-block ..."
Abstract
- Add to MetaCart
This paper reports on efficient parallel implementations of two-dimensional Delaunay triangulation in High Performance Fortran (HPF) and in Message Passing Interface (MPI). Our parallelization algorithm performs subblock triangulation and boundary merge independently at the same time. The sub-block triangulation is by a divide & conquer Delaunay algorithm known for its sequential efficiency, and the boundary triangulation is by an incremental construction algorithm with low overhead. Compared to prior work, our parallelization method is both simple and efficient. In the paper, we also describe a solution to the collinear points problem that usually arises in large data sets for divide & conquer algorithm.

