Results 1  10
of
45
Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors
 The 12th International Conference on High Performance Computing (HiPC 2005)
, 2005
"... Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that su ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
(Show Context)
Graph theoretic problems are representative of fundamental computations in traditional and emerging scientific disciplines like scientific computing and computational biology, as well as applications in national security. We present our design and implementation of a graph theory application that supports the kernels from the Scalable Synthetic Compact Applications (SSCA) benchmark suite, developed under the DARPA High Productivity Computing Systems (HPCS) program. This synthetic benchmark consists of four kernels that require irregular access to a large, directed, weighted multigraph. We have developed a parallel implementation of this benchmark in C using the POSIX thread library for commodity symmetric multiprocessors (SMPs). In this paper, we primarily discuss the data layout choices and algorithmic design issues for each kernel, and also present execution time and benchmark validation results.
On the architectural requirements for efficient execution of graph algorithms
 In Proc. 34th Int’l Conf. on Parallel Processing (ICPP
, 2005
"... Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to noncontiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, conti ..."
Abstract

Cited by 26 (10 self)
 Add to MetaCart
(Show Context)
Combinatorial problems such as those from graph theory pose serious challenges for parallel machines due to noncontiguous, concurrent accesses to global data structures with low degrees of locality. The hierarchical memory systems of symmetric multiprocessor (SMP) clusters optimize for local, contiguous memory accesses, and so are inefficient platforms for such algorithms. Few parallel graph algorithms outperform their best sequential implementation on SMP clusters due to long memory latencies and high synchronization costs. In this paper, we consider the performance and scalability of two graph algorithms, list ranking and connected components, on two classes of sharedmemory computers: symmetric multiprocessors such as the Sun Enterprise servers and multithreaded architectures
A nonlocal cost aggregation method for stereo matching,” CVPR
, 2012
"... qiyang/ Matching cost aggregation is one of the oldest and still popular methods for stereo correspondence. While effective and efficient, cost aggregation methods typically aggregate the matching cost by summing/averaging over a userspecified, local support region. This is obviously only locall ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
(Show Context)
qiyang/ Matching cost aggregation is one of the oldest and still popular methods for stereo correspondence. While effective and efficient, cost aggregation methods typically aggregate the matching cost by summing/averaging over a userspecified, local support region. This is obviously only locallyoptimal, and the computational complexity of the fullkernel implementation usually depends on the region size. In this paper, the cost aggregation problem is reexamined and a nonlocal solution is proposed. The matching cost values are aggregated adaptively based on pixel similarity on a tree structure derived from the stereo image pair to preserve depth edges. The nodes of this tree are all the image pixels, and the edges are all the edges between the nearest neighboring pixels. The similarity between any two pixels is decided by their shortest distance on the tree. The proposed method is nonlocal as every node receives supports from all other nodes on the tree. As can be expected, the proposed nonlocal solution outperforms all local cost aggregation methods on the standard (Middlebury) benchmark. Besides, it has great advantage in extremely low computational complexity: only a total of 2 addition/subtraction operations and 3 multiplication operations are required for each pixel at each disparity level. It is very close to the complexity of unnormalized box filtering using integral image which requires 6 addition/subtraction operations. Unnormalized box filter is the fastest local cost aggregation method but blurs across depth edges. The proposed method was tested on a MacBook Air laptop computer with a 1.8 GHz Intel Core i7 CPU and 4 GB memory. The average runtime on the Middlebury data sets is about 90 milliseconds, and is only about 1.25 × slower than unnormalized box filter. A nonlocal disparity refinement method is also proposed based on the nonlocal cost aggregation method. 1.
Large Graph Algorithms for Massively Multithreaded Architectures
, 2009
"... Modern Graphics Processing Units (GPUs) provide high computation power at low costs and have been described as desktop supercomputers. The GPUs expose a general, dataparallel programming model today in the form of CUDA and CAL. The GPU is presented as a massively multithreaded architecture by them. ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
Modern Graphics Processing Units (GPUs) provide high computation power at low costs and have been described as desktop supercomputers. The GPUs expose a general, dataparallel programming model today in the form of CUDA and CAL. The GPU is presented as a massively multithreaded architecture by them. Several highperformance, general data processing algorithms such as sorting, matrix multiplication, etc., have been developed for the GPUs. In this paper, we present a set of general graph algorithms on the GPU using the CUDA programming model. We present implementations of breadthfirst search, stconnectivity, singlesource shortest path, allpairs shortest path, minimum spanning tree, and maximum flow algorithms on commodity GPUs. Our implementations exhibit high performance, especially on large graphs. We experiment on random, scalefree, and reallife graphs of up to millions of vertices. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The approach has been that of divideandconquer, where individual processing nodes solve smaller subproblems followed by a combining step. The massively multithreaded model of the GPU makes it possible to adopt the dataparallel approach even to irregular algorithms like graph algorithms, using O(V) or O(E) simultaneous threads. The algorithms and the underlying techniques presented in this paper are likely to be applicable to many irregular algorithms on them. 1.
Fast Minimum Spanning Tree for Large Graphs on the GPU
"... Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, dataparallel algorithms map well to the SIMD architecture of currentGPU.Irregularalgorithmsondiscretestructureslikegraphsare harder to map to them. Efficient datamapping prim ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, dataparallel algorithms map well to the SIMD architecture of currentGPU.Irregularalgorithmsondiscretestructureslikegraphsare harder to map to them. Efficient datamapping primitives can play crucialroleinmappingsuchalgorithmsontotheGPU.Inthispaper, we present a minimum spanning tree algorithm on Nvidia GPUs underCUDA,asarecursiveformulationofBor˚uvka’sapproachfor undirected graphs. We implement it using scalable primitives such as scan, segmented scan and split. The irregular steps of supervertexformationandrecursivegraphconstructionaremappedtoprimitives like split to categories involving vertex ids and edge weights. We obtain 30 to 50 times speedup over the CPU implementation on most graphs and 3 to 10 times speedup over our previous GPU implementation. We construct the minimum spanning tree on a 5 million node and 30 million edge graph in under 1 second on one quarter of the TeslaS1070GPU.
SNAP, Smallworld Network Analysis and Partitioning: an opensource parallel graph framework for the exploration of largescale networks
"... We present SNAP (Smallworld Network Analysis and Partitioning), an opensource graph framework for exploratory study and partitioning of largescale networks. To illustrate the capability of SNAP, we discuss the design, implementation, and performance of three novel parallel community detection alg ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
We present SNAP (Smallworld Network Analysis and Partitioning), an opensource graph framework for exploratory study and partitioning of largescale networks. To illustrate the capability of SNAP, we discuss the design, implementation, and performance of three novel parallel community detection algorithms that optimize modularity, a popular measure for clustering quality in social network analysis. In order to achieve scalable parallel performance, we exploit typical network characteristics of smallworld networks, such as the low graph diameter, sparse connectivity, and skewed degree distribution. We conduct an extensive experimental study on realworld graph instances and demonstrate that our parallel schemes, coupled with aggressive algorithm engineering for smallworld networks, give significant running time improvements over existing modularitybased clustering heuristics, with little or no loss in clustering quality. For instance, our divisive clustering approach based on approximate edge betweenness centrality is more than two orders of magnitude faster than a competing greedy approach, for a variety of large graph instances on the Sun Fire T2000 multicore system. SNAP also contains parallel implementations of fundamental graphtheoretic kernels and topological analysis metrics (e.g., breadthfirst search, connected components, vertex and edge centrality) that are optimized for smallworld networks. The SNAP framework is extensible; the graph kernels are modular, portable across shared memory multicore and symmetric multiprocessor systems, and simplify the design of highlevel domainspecific applications. 1
Fast and Scalable List Ranking on the GPU
"... General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. The GPUs have been used extensively on regular problems that can be easily parallelized. In this paper, ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. The GPUs have been used extensively on regular problems that can be easily parallelized. In this paper, we describe two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on such massively multithreaded hardware. We first present an implementation of Wyllie’s algorithm based on pointer jumping. This technique does not scale well to large lists due to the suboptimal work done. We then present a GPUoptimized, Recursive HelmanJáJá (RHJ) algorithm. Our RHJ implementation can rank a random list of 32 million elements in about a second and achieves a speedup of about 89 over a CPU implementation as well as a speedup of 34 over the best reported implementation on the Cell Broadband engine. We also discuss the practical issues relating to the implementation of irregular algorithms on massively multithreaded architectures like that of the GPU. Regular or coalesced memory accesses pattern and balanced load are critical to achieve good performance on the GPU. Categories andSubjectDescriptors
Computational grand challenges in assembling the tree of life: Problems & solutions
 THE IEEE AND ACM SUPERCOMPUTING CONFERENCE 2005 (SC2005) TUTORIAL
, 2005
"... The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable time based on elaborate evolutionary models is limited by the severe computational cost inherent to these methods. There exist two orthogonal research directions to overcome this challenging computational burden: First, the development of novel, faster, and more accurate heuristic algorithms and second, the application of high performance computing techniques. The goal of this chapter is to provide a comprehensive introduction to the field of computational evolutionary biology to an audience with computing background, interested in participating in research and/or commercial applications of this field. Moreover, we will cover leadingedge technical and algorithmic developments in the field and discuss open problems and potential solutions.
The filterkruskal minimum spanning tree algorithm
, 2009
"... We present FilterKruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously” not in the MST. For arbitrary graphs with random edge weights FilterKruskal runs in time O (m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicate ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We present FilterKruskal – a simple modification of Kruskal’s algorithm that avoids sorting edges that are “obviously” not in the MST. For arbitrary graphs with random edge weights FilterKruskal runs in time O (m + n lognlog m n, i.e. in linear time for not too sparse graphs. Experiments indicate that the algorithm has very good practical performance over the entire range of edge densities. An equally simple parallelization seems to be the currently best practical algorithm on multicore machines.
SWARM: A Parallel Programming Framework for Multicore Processors
, 2007
"... Due to fundamental physical limitations and power constraints, we are witnessing a radical change in commodity microprocessor architectures to multicore designs. Continued performance on multicore processors now requires the exploitation of concurrency at the algorithmic level. In this paper, we ide ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Due to fundamental physical limitations and power constraints, we are witnessing a radical change in commodity microprocessor architectures to multicore designs. Continued performance on multicore processors now requires the exploitation of concurrency at the algorithmic level. In this paper, we identify key issues in algorithm design for multicore processors and propose a computational model for these systems. We introduce SWARM (SoftWare and Algorithms for Running on Multicore), a portable opensource parallel library of basic primitives that fully exploit multicore processors. Using this framework, we have implemented efficient parallel algorithms for important primitive operations such as prefixsums, pointerjumping, symmetry breaking, and list ranking; for combinatorial problems such as sorting and selection; for parallel graph theoretic algorithms such as spanning tree, minimum spanning tree, graph decomposition, and tree contraction; and for computational genomics applications such as maximum parsimony. The main contributions of this paper are the design of the SWARM multicore framework, the presentation of a multicore algorithmic model, and validation results for this model. SWARM is freely available as opensource from