Results 1  10
of
33
The Tao of Parallelism in Algorithms
 In PLDI
, 2011
"... For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in “regular ” algorithms that use dense arrays, such as finitedifferences and FFTs. In this paper, we argue that the dependence graph i ..."
Abstract

Cited by 40 (12 self)
 Add to MetaCart
(Show Context)
For more than thirty years, the parallel programming community has used the dependence graph as the main abstraction for reasoning about and exploiting parallelism in “regular ” algorithms that use dense arrays, such as finitedifferences and FFTs. In this paper, we argue that the dependence graph is not a suitable abstraction for algorithms in new application areas like machine learning and network analysis in which the key data structures are “irregular ” data structures like graphs, trees, and sets. To address the need for better abstractions, we introduce a datacentric formulation of algorithms called the operator formulation in which an algorithm is expressed in terms of its action on data structures. This formulation is the basis for a structural analysis of algorithms that we call taoanalysis. Taoanalysis can be viewed as an abstraction of algorithms that distills out algorithmic properties
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors
 The 12th International Conference on High Performance Computing (HiPC 2005)
, 2005
"... Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that su ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
(Show Context)
Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multigraph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
On the architectural requirements for efficient execution of graph algorithms
 In Proc. 34th Int’l Conf. on Parallel Processing (ICPP
, 2005
"... Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to noncontiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, conti ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
(Show Context)
Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to noncontiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, contiguous memory accesses, and so are inefficient platforms for such algorithms. Few parallel graph algorithms outperform their best sequential implementation on SMP clusters due to long memory latencies and high synchronization costs. In this paper, we consider the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of sharedmemory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures
Computational grand challenges in assembling the tree of life: Problems & solutions
 THE IEEE AND ACM SUPERCOMPUTING CONFERENCE 2005 (SC2005) TUTORIAL
, 2005
"... The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable time based on elaborate evolutionary models is limited by the severe computational cost inherent to these methods. There exist two orthogonal research directions to overcome this challenging computational burden: First, the development of novel, faster, and more accurate heuristic algorithms and second, the application of high performance computing techniques. The goal of this chapter is to provide a comprehensive introduction to the field of computational evolutionary biology to an audience with computing background, interested in participating in research and/or commercial applications of this field. Moreover, we will cover leadingedge technical and algorithmic developments in the field and discuss open problems and potential solutions.
An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs
 In PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles
"... Due to power wall, memory wall, and ILP wall, we are facing the end of ever increasing singlethreaded performance. For this reason, multicore and manycore processors are arising as a new paradigm to pursue. However, to fully exploit all the cores in a chip, parallel programming is often required, a ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Due to power wall, memory wall, and ILP wall, we are facing the end of ever increasing singlethreaded performance. For this reason, multicore and manycore processors are arising as a new paradigm to pursue. However, to fully exploit all the cores in a chip, parallel programming is often required, and the complexity of parallel programming raises a significant concern. Data synchronization is a major source of this programming complexity, and Transactional Memory is proposed to reduce the difficulty caused by data synchronization requirements, while providing high scalability and low performance overhead. The previous literature on Transactional Memory mostly focuses on architectural designs. Its impact on algorithms and applications has not yet been studied thoroughly. In this paper, we investigate Transactional Memory from the algorithm designer’s perspective. This paper presents an algorithmic model to assist in the design of efficient Transactional Memory algorithms and a novel Transactional Memory algorithm for computing a minimum spanning forest of sparse graphs. We emphasize multiple Transactional Memory related design issues in presenting our algorithm. We also provide experimental results on an existing software Transactional Memory system. Our algorithm demonstrates excellent scalability in the experiments, but at the same time, the experimental results reveal the clear limitation of software Transactional Memory due to its high performance overhead. Based on our experience, we highlight the necessity of efficient hardware support for Transactional Memory to realize the potential of the technology.
Amorphous Dataparallelism in Irregular Algorithms ∗
"... Most clientside applications running on multicore processors are likely to be irregular programs that deal with complex, pointerbased data structures such as large sparse graphs and trees. However, we understand very little about the nature of parallelism in irregular algorithms, let alone how to e ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
Most clientside applications running on multicore processors are likely to be irregular programs that deal with complex, pointerbased data structures such as large sparse graphs and trees. However, we understand very little about the nature of parallelism in irregular algorithms, let alone how to exploit it effectively on multicore processors. In this paper, we show that, although the behavior of irregular algorithms can be very complex, many of them have a generalized dataparallelism that we call amorphous dataparallelism. The algorithms in our study come from a variety of important disciplines such as datamining, AI, compilers, networks, and scientific computing. We also argue that these algorithms can be divided naturally into a small number of categories, and that this categorization provides a lot of insight into their behavior. Finally, we discuss how these insights should guide programming language support and parallel system implementation for irregular algorithms. 1.
Ordered vs. unordered: a comparison of parallelism and workefficiency in irregular algorithms
 In PPoPP
, 2011
"... Outside of computational science, most problems are formulated in terms of irregular data structures such as graphs, trees and sets. Unfortunately, we understand relatively little about the structure of parallelism and locality in irregular algorithms. In this paper, we study several algorithms for ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Outside of computational science, most problems are formulated in terms of irregular data structures such as graphs, trees and sets. Unfortunately, we understand relatively little about the structure of parallelism and locality in irregular algorithms. In this paper, we study several algorithms for four such problems: discreteevent simulation, singlesource shortest path, breadthfirst search, and minimal spanning trees. We show that these algorithms can be classified into two categories that we call unordered and ordered, and demonstrate experimentally that there is a tradeoff between parallelism and work efficiency: unordered algorithms usually have more parallelism than their ordered counterparts for the same problem, but they may also perform more work. Nevertheless, our experimental results show that unordered algorithms typically lead to more scalable implementations, demonstrating that less workefficient irregular algorithms may be better for parallel execution. Categories and Subject Descriptors:
High performance combinatorial algorithm design on the Cell Broadband Engine processor
, 2007
"... The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD coprocessing units (SPEs) integrated onchip. While the Cell/B.E. processor is architected for multimedia applications with regular pr ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD coprocessing units (SPEs) integrated onchip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with nonuniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cachebased and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of SoftwareManaged threads (SMThreads). We apply this generic SPE workpartitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cachebased microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPEonly implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm
Lockfree parallel algorithms: An experimental study
 In Proceedings of the 11th International Conference High Performance Computing
, 2004
"... Abstract. Lockfree shared data structures in the setting of distributed computing have received a fair amount of attention. Major motivations of lockfree data structures include increasing fault tolerance of a (possibly heterogeneous) system and getting rid of the problems associated with critical ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Lockfree shared data structures in the setting of distributed computing have received a fair amount of attention. Major motivations of lockfree data structures include increasing fault tolerance of a (possibly heterogeneous) system and getting rid of the problems associated with critical sections such as priority inversion and deadlock. For parallel computers with closelycoupled processors and shared memory, these issues are no longer major concerns. While many of the results are applicable especially when the model used is shared memory multiprocessors, no prior studies have considered improving the performance of a parallel implementation by way of lockfree programming. As a matter of fact, often times in practice lock free data structures in a distributed setting do not perform as well as those that use locks. As the data structures and algorithms for parallel computing are often drastically different from those in distributed computing, it is possible that lockfree programs perform better. In this paper we compare the similarity and difference of lockfree programming in both distributed and parallel computing environments and explore the possibility of adapting lockfree programming to parallel computing to improve performances. Lockfree programming also provides a new way of simulating PRAM and asynchronous PRAM algorithms on current parallel machines.
A study on the locality behavior of minimum spanning tree algorithms
 in HiPC ’06: International Conference on High Performance Computing
"... Abstract. Locality behavior study is crucial for achieving good performance for irregular problems. Graph algorithms with large, sparse inputs, for example, oftentimes achieve only a tiny fraction of the potential peak performance on current architectures. Compared with most numerical algorithms gr ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Locality behavior study is crucial for achieving good performance for irregular problems. Graph algorithms with large, sparse inputs, for example, oftentimes achieve only a tiny fraction of the potential peak performance on current architectures. Compared with most numerical algorithms graph algorithms lay higher pressure on the memory system. In this paper, using the minimum spanning tree problem as an example, we study the locality behavior of graph algorithms, both sequential and parallel, for arbitrary, sparse instances. We show that the inherent locality of graph algorithms may not be favored by the current architecture, and parallel graph algorithms tend to have significantly poorer locality behaviors than their sequential counterparts. As memory hierarchy gets deeper and processors start to contain multicores, our study suggests that architectural support and new parallel algorithm designs are necessary for achieving good performance for irregular graph problems.