Results 1 -
9 of
9
Data Processing on FPGAs
"... Computer architectures are quickly changing toward heterogeneous many-core systems. Such a trend opens up interesting opportunities but also raises immense challenges since the efficient use of heterogeneous many-core systems is not a trivial problem. In this paper, we explore how to program data pr ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Computer architectures are quickly changing toward heterogeneous many-core systems. Such a trend opens up interesting opportunities but also raises immense challenges since the efficient use of heterogeneous many-core systems is not a trivial problem. In this paper, we explore how to program data processing operators on top of field-programmable gate arrays (FPGAs). FPGAs are very versatile in terms of how they can be used and can also be added as additional processing units in standard CPU sockets. In the paper, we study how data processing can be accelerated using an FPGA. Our results indicate that efficient usage of FPGAs involves non-trivial aspects such as having the right computation model (an asynchronous sorting network in this case); a careful implementation that balances all the design constraints in an FPGA; and the proper integration strategy to link the FPGA to the rest of the system. Once these issues are properly addressed, our experiments show that FPGAs exhibit performance figures competitive with those of modern general-purpose CPUs while offering significant advantages in terms of power consumption and parallel stream evaluation. 1.
Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture ABSTRACT
"... Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementat ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Sorting a list of input numbers is one of the most fundamental problems in the field of computer science in general and high-throughput database applications in particular. Although literature abounds with various flavors of sorting algorithms, different architectures call for customized implementations to achieve faster sorting times. This paper presents an efficient implementation and detailed analysis of MergeSort on current CPU architectures. Our SIMD implementation with 128-bit SSE is 3.3X faster than the scalar version. In addition, our algorithm performs an efficient multiway merge, and is not constrained by the memory bandwidth. Our multi-threaded, SIMD implementation sorts 64 million floating point numbers in less than 0.5 seconds on a commodity 4-core Intel processor. This measured performance compares favorably with all previously published results. Additionally, the paper demonstrates performance scalability of the proposed sorting algorithm with respect to certain salient architectural features of modern chip multiprocessor (CMP) architectures, including SIMD width and core-count. Based on our analytical models of various architectural configurations, we see excellent scalability of our implementation with SIMD width scaling up to 16X wider than current SSE width of 128-bits, and CMP core-count scaling well beyond 32 cores. Cycle-accurate simulation of Intel’s upcoming x86 many-core Larrabee architecture confirms scalability of our proposed algorithm. 1.
Sorting Large Records On A Cell Broadband Engine
"... We consider the sorting of a large number of multifield records on the Cell Broadband engine. We show that our method, which generates runs using a 2-way merge and then merges these runs using a 4-way merge, outperforms previously proposed sort methods that use either comb sort or bitonic sort for ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We consider the sorting of a large number of multifield records on the Cell Broadband engine. We show that our method, which generates runs using a 2-way merge and then merges these runs using a 4-way merge, outperforms previously proposed sort methods that use either comb sort or bitonic sort for run generation followed by a 2-way odd-even merging of runs. Interestingly, best performance is achieved by using scalar memory copy instructions rather than vector instructions.
Optimized Mapping of Pipelined Task Graphs on the Cell BE ⋆
"... Abstract. Limited bandwidth to off-chip main memory poses a problem in chip multiprocessors for streaming applications, such as Cell BE, and will become more severe with the expected increase in the number of cores. Especially for streaming computations where the ratio between computational work and ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Limited bandwidth to off-chip main memory poses a problem in chip multiprocessors for streaming applications, such as Cell BE, and will become more severe with the expected increase in the number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, the generation of memory-efficient code is thus an important compiler optimization. We suggest to use pipelining between the SPEs over the high-bandwidth internal bus of Cell BE to reduce the required main memory bandwidth, and thereby improve the computation throughput for memory-intensive computations. At the same time, we are constrained by the limited size of SPE on-chip memory available for additional buffers that are necessary for the pipelining between SPEs. We investigate mappings of the nodes of a pipelined parallel task graph to the SPEs that are optimal trade-offs between load balancing, buffer memory consumption, and communication load on the on-chip bus. We solve this multiobjective optimization problem by deriving an integer linear programming (ILP) formulation and compute Pareto-optimal solutions for the mapping with a stateof-the-art ILP solver. For larger problem instances, we sketch a two-step approach to reduce problem size. We exemplify our mapping technique with several memory-intensive example problems: with acyclic pipelined task graphs derived from data parallel code, with complete d-ary tree pipelines for parallel mergesort on Cell BE, and with butterfly pipelines for parallel FFT on Cell BE. We validate the mappings with discrete event simulations. 1
State-of-the-art in heterogeneous computing
, 2010
"... Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as wel ..."
Abstract
- Add to MetaCart
Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.
Limited Distribution Notice
, 2011
"... This report has been submitted for publication outside of IBM and will be probably copyrighted if accepted. It has been issued as a Research Report for early dissemination of its contents. In view of the expected transfer of copyright to an outside publisher, its distribution outside of IBM prior to ..."
Abstract
- Add to MetaCart
This report has been submitted for publication outside of IBM and will be probably copyrighted if accepted. It has been issued as a Research Report for early dissemination of its contents. In view of the expected transfer of copyright to an outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or copies of the article legally obtained (for example, by payment of royalties). 1 Faster Funny Matrix Multiplication for the All-Pairs Shortest Paths Problem Funny Matrix Multiplication (FMM) is a matrix multiplication operation in which the scalar addition and multiplication operations are replaced by the scalar minimization and addition operations, respectively. It is a fundamental computational task for matrices and its applications include the allpairs shortest paths problem. Recently McAuley and Caetano have proposed a new algorithm whose expected computation time is significantly shorter than that of the straightforward FMM computation, while the worst-case time complexity remains unchanged. This paper gives an improved faster FMM algorithm that exploits instruction-level parallelism. By using this new algorithm, the all-pairs shortest paths problem can be solved much more quickly than with the Floyd-Warshall algorithm.
Sorting On A Cell Broadband Engine SPU
, 2009
"... We adapt merge sort for a single SPU of the Cell Broadband Engine. This adaptation takes advantage of the vector instructions supported by the SPU. Experimental results indicate that our merge sort adaptation is faster than other sort algorithms (e.g., AA sort, Cell sort, quick sort) proposed for th ..."
Abstract
- Add to MetaCart
We adapt merge sort for a single SPU of the Cell Broadband Engine. This adaptation takes advantage of the vector instructions supported by the SPU. Experimental results indicate that our merge sort adaptation is faster than other sort algorithms (e.g., AA sort, Cell sort, quick sort) proposed for the SPU as well as faster than our SPU adaptations of shaker sort and brick sort. An added advantage is that our merge sort adaptation is a stable sort whereas none of the other sort adaptations is stable.
GRS -- GPU Radix Sort For Multifield Records
"... We develop a radix sort algorithm suitable to sort multifield records on a graphics processing unit (GPU). We assume the ByF ield layout for records to be sorted. Our radix sort algorithm, GRS, is benchmarked against the radix sort algorithm in NVIDIA’s CUDA SDK 3.0, which is the fastest known GPU s ..."
Abstract
- Add to MetaCart
We develop a radix sort algorithm suitable to sort multifield records on a graphics processing unit (GPU). We assume the ByF ield layout for records to be sorted. Our radix sort algorithm, GRS, is benchmarked against the radix sort algorithm in NVIDIA’s CUDA SDK 3.0, which is the fastest known GPU sorting algorithm for 32-bit integers. Our experiments show that GRS is 21 % faster than SDK sort while sorting 100M numbers and is faster by between 34 % and 55% when sorting 40M records with 1 to 9 32-bit fields. This makes GRS the fastest sort algorithm for GPUs.
An Efficient Parallel Sorting Algorithm for Multicore Machines
"... Abstract — Sorting an array of integers is one of the most basic problems in Computer Science. Also it is an issue in high performance database applications. Though literature is imbued with a variety of sorting algorithms, different architectures need different optimizations to reduce sorting time. ..."
Abstract
- Add to MetaCart
Abstract — Sorting an array of integers is one of the most basic problems in Computer Science. Also it is an issue in high performance database applications. Though literature is imbued with a variety of sorting algorithms, different architectures need different optimizations to reduce sorting time. This paper presents a Multicore ready parallel sorting algorithm which has been designed with Multicore/Manycore architecture in mind. Our study shows that the proposed algorithm is excellent for large input size and multiple free cores. In essence algorithm has potential to be a success in situations when one has large input and machine is a Multicore machine. The paper does not neglect overhead involved with parallel programming and suggests two system calls to check the availability of free cores and to reserve a core for a fixed time quantum. Keywords — Multicore. I.

