Results 1  10
of
398
A survey of generalpurpose computation on graphics hardware
, 2007
"... The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the l ..."
Abstract

Cited by 319 (16 self)
 Add to MetaCart
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping generalpurpose computation to graphics hardware. We begin with the technical motivations that underlie generalpurpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping generalpurpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in generalpurpose application development on graphics hardware.
Performance Analysis of kary ncube Interconnection Networks
 IEEE Transactions on Computers
, 1988
"... VLSI communication networks are wire limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. This paper analyzes communication networks of varying dimension under the assumption of constant wi ..."
Abstract

Cited by 296 (16 self)
 Add to MetaCart
VLSI communication networks are wire limited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. This paper analyzes communication networks of varying dimension under the assumption of constant wire bisection. Expressions for the latency, average case throughput, and hotspot throughput of kary n cube networks with constant bisection are derived that agree closely with experimental measurements. It is shown that lowdimensional networks (e.g., tori) have lower latency and higher hotspot throughput than highdimensional networks (e.g., binary ncubes) with the same bisection width. Keywords Communication networks, interconnection networks, concurrent computing, messagepassing multiprocessors, parallel processing, VLSI. 1 Introduction The critical component of a concurrent computer is its communication network. Many algorithms are communication rather than processing limited. Fi...
High Speed Switch Scheduling for Local Area Networks
 ACM Transactions on Computer Systems
, 1993
"... Current technology trends make it possible to build communication networks that can support high performance distributed computing. This paper describes issues in the design of a prototype switch for an arbitrary topology pointtopoint network with link speeds of up to one gigabit per second. The s ..."
Abstract

Cited by 201 (3 self)
 Add to MetaCart
Current technology trends make it possible to build communication networks that can support high performance distributed computing. This paper describes issues in the design of a prototype switch for an arbitrary topology pointtopoint network with link speeds of up to one gigabit per second. The switch deals in fixedlength ATMstyle cells, which it can process at a rate of 37 million cells per second. It provides high bandwidth and low latency for datagram traffic. In addition, it supports realtime traffic by providing bandwidth reservations with guaranteed latency bounds. The key to the switch's operation is a technique called parallel iterative matching, which can quickly identify a set of conflictfree cells for transmission in a time slot. Bandwidth reservations are accommodated in the switch by building a fixed schedule for transporting cells from reserved flows across the switch; parallel iterative matching can fill unused slots with datagram traffic. Finally, we note that pa...
Software Protection and Simulation on Oblivious RAMs
, 1993
"... Software protection is one of the most important issues concerning computer practice. There exist many heuristics and adhoc methods for protection, but the problem as a whole has not received the theoretical treatment it deserves. In this paper we provide theoretical treatment of software protectio ..."
Abstract

Cited by 163 (13 self)
 Add to MetaCart
Software protection is one of the most important issues concerning computer practice. There exist many heuristics and adhoc methods for protection, but the problem as a whole has not received the theoretical treatment it deserves. In this paper we provide theoretical treatment of software protection. We reduce the problem of software protection to the problem of efficient simulation on oblivious RAM. A machine is oblivious if the sequence in which it accesses memory locations is equivalent for any two inputs with the same running time. For example, an oblivious Turing Machine is one for which the movement of the heads on the tapes is identical for each computation. (Thus, it is independent of the actual input.) What is the slowdown in the running time of any machine, if it is required to be oblivious? In 1979 Pippenger and Fischer showed how a twotape oblivious Turing Machine can simulate, online, a onetape Turing Machine, with a logarithmic slowdown in the running time. We s...
Scans as Primitive Parallel Operations
 IEEE Transactions on Computers
, 1987
"... In most parallel randomaccess machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of ..."
Abstract

Cited by 157 (12 self)
 Add to MetaCart
In most parallel randomaccess machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can executed in no more time than these parallel memory references. This paper outline an extensive study of the effect of including in the PRAM models, such scan operations as unittime primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(lg n) factor, greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. We therefore argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radixsort algorithm, a quicksort algorithm, a minimumspanning tree algorithm, a linedrawing algorithm and a mergi...
Photon Mapping on Programmable Graphics Hardware
 GRAPHICS HARDWARE
, 2003
"... We present a modified photon mapping algorithm capable of running entirely on GPUs. Our implementation uses breadthfirst photon tracing to distribute photons using the GPU. The photons are stored in a gridbased photon map that is constructed directly on the graphics hardware using one of two met ..."
Abstract

Cited by 125 (4 self)
 Add to MetaCart
We present a modified photon mapping algorithm capable of running entirely on GPUs. Our implementation uses breadthfirst photon tracing to distribute photons using the GPU. The photons are stored in a gridbased photon map that is constructed directly on the graphics hardware using one of two methods: the first method is a multipass technique that uses fragment programs to directly sort the photons into a compact grid. The second method uses a single rendering pass combining a vertex program and the stencil buffer to route photons to their respective grid cells, producing an approximate photon map. We also present an efficient method for locating the nearest photons in the grid, which makes it possible to compute an estimate of the radiance at any surface location in the scene. Finally, we describe a breadthfirst stochastic ray tracer that uses the photon map to simulate full global illumination directly on the graphics hardware. Our implementation demonstrates that current graphics hardware is capable of fully simulating global illumination with progressive, interactive feedback to the user.
Translating pseudoboolean constraints into SAT
 Journal on Satisfiability, Boolean Modeling and Computation
, 2006
"... In this paper, we describe and evaluate three different techniques for translating pseudoboolean constraints (linear constraints over boolean variables) into clauses that can be handled by a standard SATsolver. We show that by applying a proper mix of translation techniques, a SATsolver can perfor ..."
Abstract

Cited by 121 (2 self)
 Add to MetaCart
In this paper, we describe and evaluate three different techniques for translating pseudoboolean constraints (linear constraints over boolean variables) into clauses that can be handled by a standard SATsolver. We show that by applying a proper mix of translation techniques, a SATsolver can perform on a par with the best existing native pseudoboolean solvers. This is particularly valuable in those cases where the constraint problem of interest is naturally expressed as a SAT problem, except for a handful of constraints. Translating those constraints to get a pure clausal problem will take full advantage of the latest improvements in SAT research. A particularly interesting result of this work is the efficiency of sorting networks to express pseudoboolean constraints. Although tangential to this presentation, the result gives a suggestion as to how synthesis tools may be modified to produce arithmetic circuits more suitable for SAT based reasoning. Keywords: pseudoBoolean, SATsolver, SAT translation, integer linear programming
Adding Networks
, 2001
"... An adding network is a distributed data structure that supports a concurrent, lockfree, lowcontention implementation of a fetch&add counter; a counting network is an instance of an adding network that supports only fetch&increment. We present a lower bound showing that adding networks have inher ..."
Abstract

Cited by 106 (31 self)
 Add to MetaCart
An adding network is a distributed data structure that supports a concurrent, lockfree, lowcontention implementation of a fetch&add counter; a counting network is an instance of an adding network that supports only fetch&increment. We present a lower bound showing that adding networks have inherently high latency. Any adding network powerful enough to support addition by at least two values a and b, where a > b > 0, has sequential executions in which each token traverses Ω(n/c) switching elements, where n is the number of concurrent processes, and c is a quantity we call oneshot contention; for a large class of switching networks and for conventional counting networks the oneshot contention is constant. On the contrary, counting networks have O(log n) latency [4,7]. This bound is tight. We present the first concurrent, lockfree, lowcontention networked data structure that supports arbitrary fetch&add operations.
GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management
, 2006
"... We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We ..."
Abstract

Cited by 102 (10 self)
 Add to MetaCart
We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We therefore exploit both the highbandwidth GPU memory interface and the lowerbandwidth CPU main memory interface and achieve higher memory bandwidth than purely CPUbased algorithms. GPUTeraSort is a twophase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. It also pipelines disk transfers and achieves nearpeak I/O performance. We have tested the performance of GPUTeraSort on billionrecord files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPUbased algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort priceperformance. These results suggest that a GPU coprocessor can significantly improve performance on large data processing tasks. 1.