Results 1  10
of
22
New acyclic and star coloring algorithms with application to computing Hessians
 SIAM JOURNAL ON SCIENTIFIC COMPUTING VOL
, 2007
"... Acyclic and star coloring problems are specialized vertex coloring problems that arise in the efficient computation of Hessians using automatic differentiation or finite differencing, when both sparsity and symmetry are exploited. We present an algorithmic paradigm for finding heuristic solutions fo ..."
Abstract

Cited by 20 (14 self)
 Add to MetaCart
(Show Context)
Acyclic and star coloring problems are specialized vertex coloring problems that arise in the efficient computation of Hessians using automatic differentiation or finite differencing, when both sparsity and symmetry are exploited. We present an algorithmic paradigm for finding heuristic solutions for these two NPhard problems. The underlying common technique is the exploitation of the structure of twocolored induced subgraphs. For a graph G on n vertices and m edges, the time complexity of our star coloring algorithm is O(nd2), where dk, a generalization of vertex degree, denotes the average number of distinct paths of length at most k edges starting at a vertex in G. The time complexity of our acyclic coloring algorithm is larger by a multiplicative factor involving the inverse of Ackermann’s function. The space complexity of both algorithms is O(m). To the best of our knowledge, our work is the first practical algorithm for the acyclic coloring problem. For the star coloring problem, our algorithm uses fewer colors and is considerably faster than a previously known O(nd3)time algorithm. Computational results from experiments on various largesize test graphs demonstrate that the algorithms are fast and produce highly effective solutions. The use of these algorithms in Hessian computation is expected to reduce overall runtime drastically.
Cacheoblivious sparse matrix–vector multiplication by using sparse matrix partitioning methods.
 SIAM Journal on Scientific Computing
, 2009
"... Abstract The sparse matrixvector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak per ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Abstract The sparse matrixvector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak performance. Cacheaware algorithms take information on specifics of the cache architecture as a parameter to derive an efficient SpMV multiply. In contrast, cacheoblivious algorithms strive to obtain efficient algorithms regardless of cache specifics. In this area, earlier research by
A scalable parallel graph coloring algorithm for distributed memory computers
 In EuroPar
, 2005
"... Abstract. In largescale parallel applications a graph coloring is often carried out to schedule computational tasks. In this paper, we describe a new distributedmemory algorithm for doing the coloring itself in parallel. The algorithm operates in an iterative fashion; in each round vertices are spe ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Abstract. In largescale parallel applications a graph coloring is often carried out to schedule computational tasks. In this paper, we describe a new distributedmemory algorithm for doing the coloring itself in parallel. The algorithm operates in an iterative fashion; in each round vertices are speculatively colored based on limited information, and then a set of incorrectly colored vertices, to be recolored in the next round, is identified. Parallel speedup is achieved in part by reducing the frequency of communication among processors. Experimental results on a PC cluster using up to 16 processors show that the algorithm is scalable. 1
Predicting locality phases for dynamic memory optimization
 J. PARALLEL DISTRIB. COMPUT. 67 (2007) 783–796
, 2007
"... ..."
(Show Context)
A Keybased Adaptive Transactional Memory Executor
, 2006
"... Software transactional memory systems enable a programmer to easily write concurrent data structures such as lists, trees, hashtables, and graphs, where nonconflicting operations proceed in parallel. Many of these structures take the abstract form of a dictionary, in which each transaction is assoc ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Software transactional memory systems enable a programmer to easily write concurrent data structures such as lists, trees, hashtables, and graphs, where nonconflicting operations proceed in parallel. Many of these structures take the abstract form of a dictionary, in which each transaction is associated with a search key. By regrouping transactions based on their keys, one may improve locality and reduce conflicts among parallel transactions. In this paper, we present an executor that partitions transactions among available processors. Our keybased adaptive partitioning monitors incoming transactions, estimates the probability distribution of their keys, and adaptively determines the (usually nonuniform) partitions. By comparing the adaptive partitioning with uniform partitioning and roundrobin keyless partitioning on a 16processor SunFire 6800 machine, we demonstrate that keybased adaptive partitioning significantly improves the throughput of finegrained parallel operations on concurrent data structures.
DISTRIBUTEDMEMORY PARALLEL ALGORITHMS FOR DISTANCE2 COLORING AND THEIR APPLICATION TO DERIVATIVE COMPUTATION
, 2010
"... The distance2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Its applications include derivative computation in numerical optimization and channel assignment in radio networks. ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
The distance2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Its applications include derivative computation in numerical optimization and channel assignment in radio networks. We present efficient, distributedmemory, parallel heuristic algorithms for this NPhard problem as well as for two related problems used in the computation of Jacobians and Hessians. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSPlike organization of parallel computation. Results from experiments conducted on a PC cluster employing up to 96 processors and using largesize realworld as well as synthetically generated test graphs show that the algorithms are scalable. In terms of quality of solution, the algorithms perform remarkably well—the number of colors used by the parallel algorithms was observed to be very close to the number used by the sequential counterparts, which in turn are quite often near optimal. Moreover, the experimental results show that the parallel distance2 coloring algorithm compares favorably with the alternative approach of solving the distance2 coloring problem on a graph G by first constructing the square graph G² and then applying a parallel distance1 coloring algorithm on G2. Implementations of the algorithms are made available via the Zoltan loadbalancing library.
A Parallel Distance2 Graph Coloring Algorithm for Distributed Memory Computers
 IN PROCEEDINGS OF HPCC05, THE 2005 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, L.T. YANG ET AL., EDS., LECTURE NOTES IN COMPUT. SCI. 3726
, 2005
"... The distance2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Application examples include numerical optimization and channel assignment. We present the first distributedmemory ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
The distance2 graph coloring problem aims at partitioning the vertex set of a graph into the fewest sets consisting of vertices pairwise at distance greater than two from each other. Application examples include numerical optimization and channel assignment. We present the first distributedmemory heuristic algorithm for this NPhard problem. Parallel speedup is achieved through graph partitioning, speculative (iterative) coloring, and a BSPlike organization of computation. Experimental results show that the algorithm is scalable, and compares favorably with an alternative approach  solving the problem on a by first constructing the square graph and then applying a parallel distance1 coloring algorithm on .
The Efficacy of Software Prefetching and Locality Optimizations on Future Memory Systems
, 2004
"... Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we provide a comprehensive summary of current software prefetching and locality optimization techniques, and evaluate the impact of memory trends on the effectiven ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we provide a comprehensive summary of current software prefetching and locality optimization techniques, and evaluate the impact of memory trends on the effectiveness of these techniques for three types of applications: regular scientific codes, irregular scientific codes, and pointerchasing codes. We find that for many applications, software prefetching outperforms locality optimizations when there is sufficient memory bandwidth, but locality optimizations outperform software prefetching under bandwidthlimited conditions. The breakeven point (for 1 GHz processors) occurs at roughly 2.26 GBytes/sec on today’s memory systems, and will increase on future memory systems. We also study the interactions between software prefetching and locality optimizations when applied in concert. Naively combining the techniques provides robustness to changes in memory bandwidth and latency, but does not yield additional performance gains. We propose and evaluate several algorithms to better integrate software prefetching and locality optimizations, including a modified tiling algorithm, padding for prefetching, and index prefetching. Finally, we investigate the interactions of stridebased hardware prefetching with our software techniques. We find that combining hardware and software prefetching yields similar performance to software prefetching alone, and that locality optimizations enable stridebased hardware prefetching for benchmarks that do not normally exhibit striding.
The Potential of Computation Regrouping for Improving Locality
, 2004
"... Improving program locality has become increasingly important on modern computer systems. An effective strategy is to group computations on the same data so that once the data are loaded into cache, the program performs all their operations before the data are evicted. However, computation regrouping ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Improving program locality has become increasingly important on modern computer systems. An effective strategy is to group computations on the same data so that once the data are loaded into cache, the program performs all their operations before the data are evicted. However, computation regrouping is difficult to automate for programs with complex data and control structures. This paper studies the potential of locality improvement through tracedriven computation regrouping. First, it shows that maximizing the locality is different from maximizing the parallelism or maximizing the cache utilization. The problem is NPhard even without considering data dependences and cache organization. Then the paper describes a tool that performs constrained computation regrouping on program traces. The new tool is unique because it measures the exact control dependences and applies complete memory renaming and reallocation. Using the tool, the paper measures the potential locality improvement in a set of commonly used benchmark programs written in C.
Improving graph coloring on distributedmemory parallel computers
 in High Performance Computing (HiPC), 2011 18th International Conference on
, 2011
"... Graph coloring is a combinatorial optimization problem that classically appears in distributed computing to identify the sets of tasks that can be safely performed in parallel. Despite many existing efficient sequential algorithms being known for this NPComplete problem, distributed variants are ch ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Graph coloring is a combinatorial optimization problem that classically appears in distributed computing to identify the sets of tasks that can be safely performed in parallel. Despite many existing efficient sequential algorithms being known for this NPComplete problem, distributed variants are challenging. Building on an existing distributedmemory graph coloring framework, we investigate two techniques in this paper. First, we investigate the application of two different vertexvisit orderings, namely Largest First and Smallest Last, in a distributed context and show that they can help to significantly decrease the number of colors, on small to mediumscale parallel architectures. Second, we investigate the use of a distributed postprocessing operation, called recoloring, which further drastically improves the number of colors while not increasing the runtime more than twofold on large graphs. We also investigate the use of multicore architectures for distributed graph coloring algorithms. 1.