Results 1 - 10
of
17
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors
- The 12th International Conference on High Performance Computing (HiPC 2005)
, 2005
"... Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that su ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multi-graph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
On the architectural requirements for efficient execution of graph algorithms
- In Proc. 34th Int’l Conf. on Parallel Processing (ICPP
, 2005
"... Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, conti ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to non-contiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, contiguous memory accesses, and so are inefficient platforms for such algorithms. Few parallel graph algorithms outperform their best sequential implementation on SMP clusters due to long memory latencies and high synchronization costs. In this paper, we consider the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of sharedmemory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures
SWARM: A Parallel Programming Framework for Multicore Processors
, 2007
"... Due to fundamental physical limitations and power constraints, we are witnessing a radical change in commodity microprocessor architectures to multicore designs. Continued performance on multicore processors now requires the exploitation of concurrency at the algorithmic level. In this paper, we ide ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Due to fundamental physical limitations and power constraints, we are witnessing a radical change in commodity microprocessor architectures to multicore designs. Continued performance on multicore processors now requires the exploitation of concurrency at the algorithmic level. In this paper, we identify key issues in algorithm design for multicore processors and propose a computational model for these systems. We introduce SWARM (SoftWare and Algorithms for Running on Multi-core), a portable open-source parallel library of basic primitives that fully exploit multicore processors. Using this framework, we have implemented efficient parallel algorithms for important primitive operations such as prefixsums, pointer-jumping, symmetry breaking, and list ranking; for combinatorial problems such as sorting and selection; for parallel graph theoretic algorithms such as spanning tree, minimum spanning tree, graph decomposition, and tree contraction; and for computational genomics applications such as maximum parsimony. The main contributions of this paper are the design of the SWARM multicore framework, the presentation of a multicore algorithmic model, and validation results for this model. SWARM is freely available as open-source from
The Euler tour technique and parallel rooted spanning tree
- In Proc. Int’l Conf. on Parallel Processing (ICPP
, 2004
"... Many parallel algorithms for graph problems start with finding a spanning tree and rooting the tree to define some structural relationship on the vertices which can be used by following problem specific computations. The generic procedure is to find an unrooted spanning tree and then root the spanni ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Many parallel algorithms for graph problems start with finding a spanning tree and rooting the tree to define some structural relationship on the vertices which can be used by following problem specific computations. The generic procedure is to find an unrooted spanning tree and then root the spanning tree using the Euler tour technique. With a randomized work-time optimal unrooted spanning tree algorithm and work-time optimal list ranking, finding rooted spanning trees can be done work-time optimally on EREW PRAM w.h.p. Yet the Euler tour technique assumes as “given ” a circular adjacency list, it is not without implications though to construct the circular adjacency list for the spanning tree found on the fly by a spanning tree algorithm. In fact our experiments show that this “hidden ” step of constructing a circular adjacency list could take as much time as both spanning tree and list ranking combined. In this paper we present new efficient algorithms that find rooted spanning trees without using the Euler tour technique and incur little or no overhead over the underlying spanning tree algorithms. We also present two new approaches that construct Euler tours efficiently when the circular adjacency list is not given. One is a deterministic PRAM algorithm and the other is a randomized algorithm in the symmetric multiprocessor (SMP) model. The randomized algorithm takes a novel approach for the problems of constructing the Euler tour and rooting a tree. It computes a rooted spanning tree first, then constructs an Euler tour directly for the tree using depth-first traversal. The tour constructed is cache-friendly with adjacent edges in the
High performance combinatorial algorithm design on the Cell Broadband Engine processor
, 2007
"... The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular pr ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The Sony–Toshiba–IBM Cell Broadband Engine (Cell/B.E.) is a heterogeneous multicore architecture that consists of a traditional microprocessor (PPE) with eight SIMD co-processing units (SPEs) integrated on-chip. While the Cell/B.E. processor is architected for multimedia applications with regular processing requirements, we are interested in its performance on problems with non-uniform memory access patterns. In this article, we present two case studies to illustrate the design and implementation of parallel combinatorial algorithms on Cell/B.E.: we discuss list ranking, a fundamental kernel for graph problems, and zlib, a data compression and decompression library. List ranking is a particularly challenging problem to parallelize on current cache-based and distributed memory architectures due to its low computational intensity and irregular memory access patterns. To tolerate memory latency on the Cell/B.E. processor, we decompose work into several independent tasks and coordinate computation using the novel idea of Software-Managed threads (SM-Threads). We apply this generic SPE work-partitioning technique to efficiently implement list ranking, and demonstrate substantial speedup in comparison to traditional cache-based microprocessors. For instance, on a 3.2 GHz IBM QS20 Cell/B.E. blade, for a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation. Our second case study, zlib, is a data compression/decompression library that is extensively used in both scientific as well as general purpose computing. The core kernels in the zlib library are the LZ77 longest subsequence matching algorithm
The filter-kruskal minimum spanning tree algorithm
, 2009
"... We present Filter-Kruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously ” not in the MST. For arbitrary graphs with random edge weights Filter-Kruskal runs in time O ( m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicat ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present Filter-Kruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously ” not in the MST. For arbitrary graphs with random edge weights Filter-Kruskal runs in time O ( m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicate that the algorithm has very good practical performance over the entire range of edge densities. An equally simple parallelization seems to be the currently best practical algorithm on multicore machines. 1
Computational grand challenges in assembling the tree of life: Problems and solutions
- The IEEE and ACM Supercomputing Conference 2005 (SC2005) Tutorial
, 2005
"... Abstract. The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable time based on elaborate evolutionary models is limited by the severe computational cost inherent to these methods. There exist two orthogonal research directions to overcome this challenging computational burden: First, the development of novel, faster, and more accurate heuristic algorithms and second, the application of high performance computing techniques. The goal of this chapter is to provide a comprehensive introduction to the field of computational evolutionary biology to an audience with computing background, interested in participating in research and/or commercial applications of this field. Moreover, we will cover leading-edge technical and algorithmic developments in the field and discuss open problems and potential solutions.
SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks
"... We present SNAP (Small-world Network Analysis and Partitioning), an open-source graph framework for exploratory study and partitioning of large-scale networks. To illustrate the capability of SNAP, we discuss the design, implementation, and performance of three novel parallel community detection alg ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present SNAP (Small-world Network Analysis and Partitioning), an open-source graph framework for exploratory study and partitioning of large-scale networks. To illustrate the capability of SNAP, we discuss the design, implementation, and performance of three novel parallel community detection algorithms that optimize modularity, a popular measure for clustering quality in social network analysis. In order to achieve scalable parallel performance, we exploit typical network characteristics of small-world networks, such as the low graph diameter, sparse connectivity, and skewed degree distribution. We conduct an extensive experimental study on real-world graph instances and demonstrate that our parallel schemes, coupled with aggressive algorithm engineering for smallworld networks, give significant running time improvements over existing modularity-based clustering heuristics, with little or no loss in clustering quality. For instance, our divisive clustering approach based on approximate edge betweenness centrality is more than two orders of magnitude faster than a competing greedy approach, for a variety of large graph instances on the Sun Fire T2000 multicore system. SNAP also contains parallel implementations of fundamental graph-theoretic kernels and topological analysis metrics (e.g., breadth-first search, connected components, vertex and edge centrality) that are optimized for smallworld networks. The SNAP framework is extensible; the graph kernels are modular, portable across shared memory multicore and symmetric multiprocessor systems, and simplify the design of high-level domain-specific applications. 1
An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs
- In Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005
, 2004
"... We present an experimental study of parallel biconnected components algorithms employing several fundamental parallel primitives, e.g., prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations. Previous experimental studies of these primitives demonstrate reasonable para ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We present an experimental study of parallel biconnected components algorithms employing several fundamental parallel primitives, e.g., prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations. Previous experimental studies of these primitives demonstrate reasonable parallel speedups. However, when these algorithms are used as subroutines to solve higher-level problems, there are two factors that hinder fast parallel implementations. One is parallel overhead, i.e., the large constant factors hidden in the asymptotic bounds; the other is the discrepancy among the data structures used in the primitives that brings non-negligible conversion cost. We present various optimization techniques and a new parallel algorithm that significantly improve the performance of finding biconnected components of a graph on symmetric multiprocessors (SMPs). Finding biconnected components has application in fault-tolerant network design, and is also used in graph planarity testing. Our parallel implementation achieves speedups up to 4 using 12 processors on a Sun E4500 for large, sparse graphs, and the source code is freely-available at our web site
Techniques for Designing Efficient Parallel Graph Algorithms for SMPs and Multicore Processors
"... Abstract. Graph problems are finding increasing applications in high performance computing disciplines. Although many regular problems can be solved efficiently in parallel, obtaining efficient implementations for irregular graph problems remains a challenge. We propose techniques for designing and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Graph problems are finding increasing applications in high performance computing disciplines. Although many regular problems can be solved efficiently in parallel, obtaining efficient implementations for irregular graph problems remains a challenge. We propose techniques for designing and implementing efficient parallel algorithms for graph problems on symmetric multiprocessors and chip multiprocessors with a case study of parallel tree and connectivity algorithms. The problems we study represent a wide range of irregular problems that have fast theoretic parallel algorithms but no known efficient parallel implementations that achieve speedup without serious restricting assumptions about the inputs. We believe our techniques will be of practical impact in solving largescale graph problems.

