Results 1 - 10
of
15
Simplified Parallel Domain Traversal
"... Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributedmemory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a f ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributedmemory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have designed DStep, a flexible system that greatly simplifies efficient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalability on HPC platforms, we introduce a novel two-tiered communication architecture for managing and exploiting asynchronous communication loads. We also integrate our design with advanced parallel I/O techniques that operate directly on native simulation output. We demonstrate DStep by performing teleconnection analysis across ensemble runs of terascale atmospheric CO2 and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.
StreamMR: An Optimized MapReduce Framework for AMD GPUs
"... Abstract—MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally foc ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU. Our investigation reveals that the design and mapping of the MapReduce framework needs to be revisited for AMD GPUs due to their notable architectural differences from NVIDIA GPUs. For instance, current state-of-the-art MapReduce implementations employ atomic operations to coordinate the execution of different threads. However, atomic operations can implicitly cause inefficient memory access, and in turn, severely impact performance. In this paper, we propose StreamMR, an OpenCL MapReduce framework optimized for AMD GPUs. With efficient atomic-free algorithms for output handling and intermediate result shuffling, StreamMR is superior to atomic-based MapReduce designs and can outperform existing atomic-free MapReduce implementations by nearly five-fold on an AMD Radeon HD 5870. Index Terms—atomics, parallel computing, AMD GPU,
Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework
"... Graphics Processing Units (GPUs) are becoming the workhorse of scalable computations. MADNESS is a scientific framework used especially for computational chemistry. Most MADNESS applications use operators that involve many small tensor computations, resulting in a less regular organization of compu ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Graphics Processing Units (GPUs) are becoming the workhorse of scalable computations. MADNESS is a scientific framework used especially for computational chemistry. Most MADNESS applications use operators that involve many small tensor computations, resulting in a less regular organization of computations on GPUs. A single GPU kernel may have to multiply by hundreds of small square matrices (with fixed dimension ranging from 10 to 28). We demonstrate a scalable CPU-GPU implementation of the MADNESS framework over a 500-node partition on the Titan supercomputer. For this hybrid CPU-GPU implementation, we observe up to a 2.3-times speedup compared to an equivalent CPU-only implementation with 16 cores per node. For smaller matrices, we demonstrate a speedup of 2.2-times by using a custom CUDA kernel rather than a cuBLAS-based kernel.
Leveraging on high-performance computing and cloud technologies in digital libraries: A case study
- in Proceedings of HPCCloud-11, Workshop on Integration and Application of Cloud Computing to High Performance Computing
, 2011
"... Abstract—With the emergence of high-performance computing instances in the cloud, massive scale computations have become available to technically every organization. Digital libraries typically employ a data-intensive infrastructure, but given the resources, advanced services based on data and text ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—With the emergence of high-performance computing instances in the cloud, massive scale computations have become available to technically every organization. Digital libraries typically employ a data-intensive infrastructure, but given the resources, advanced services based on data and text mining could be developed. A fundamental issue is the ease of development and integration of such services. We demonstrate the feasibility by providing a case study on a visual machine learning algorithm with MapReduce running in the cloud in a small cluster.
Accelerating text mining workloads in a mapreduce-based distributed gpu environment
- J. Parallel Distrib. Comput
, 2013
"... Scientific computations have been using GPU-enabled computers success-fully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infras-tructure. Since the initial steps of text mining are typically data-intens ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Scientific computations have been using GPU-enabled computers success-fully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infras-tructure. Since the initial steps of text mining are typically data-intensive, and the ease of deployment of algorithms is an important factor in develop-ing advanced applications, we introduce a flexible, distributed, MapReduce-based text mining workflow that performs I/O-bound operations on CPUs with industry-standard tools and then runs compute-bound operations on GPUs which are optimized to ensure coalesced memory access and effec-tive use of shared memory. We have performed extensive tests of our algo-rithms on a cluster of eight nodes with two NVidia Tesla M2050 attached to each, and we achieve considerable speedups for random projection and self-organizing maps.
VirtCL: A Framework for OpenCL Device Abstraction and Management
, 2016
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Scaling Large-Data Computations on Multi-GPU Accelerators
"... ABSTRACT Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. F ..."
Abstract
- Add to MetaCart
ABSTRACT Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run outof-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.
Comparison Based Sorting for Systems with Multiple
"... As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. ..."
Abstract
- Add to MetaCart
(Show Context)
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.
Moim: A Multi-GPU MapReduce Framework
"... MapReduce greatly decrease the complexity of devel-oping applications for parallel data processing. To con-siderably improve the performance of MapReduce appli-cations, we design a new MapReduce framework, called Moim, which 1) effectively utilizes both CPUs and GPUs (general purpose Graphics Proces ..."
Abstract
- Add to MetaCart
(Show Context)
MapReduce greatly decrease the complexity of devel-oping applications for parallel data processing. To con-siderably improve the performance of MapReduce appli-cations, we design a new MapReduce framework, called Moim, which 1) effectively utilizes both CPUs and GPUs (general purpose Graphics Processing Units), 2) overlaps CPU and GPU computations, 3) enhances load balancing in the map and reduce phases, and 4) efficiently handles not only fixed but also variable size data. We have implemented Moim and compared its performance with an advanced multi-GPU MapReduce framework. Moim achieves 20%−90 % speedup for dif-ferent data sizes and numbers of the GPUs used for data processing. 1
Simplified Parallel Domain Traversal
"... Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have de-signed DStep, a ..."
Abstract
- Add to MetaCart
(Show Context)
Many data-intensive scientific analysis techniques require global domain traversal, which over the years has been a bottleneck for efficient parallelization across distributed-memory architectures. Inspired by MapReduce and other simplified parallel programming approaches, we have de-signed DStep, a flexible system that greatly simplifies effi-cient parallelization of domain traversal techniques at scale. In order to deliver both simplicity to users as well as scalabil-ity on HPC platforms, we introduce a novel two-tiered com-munication architecture for managing and exploiting asyn-chronous communication loads. We also integrate our de-sign with advanced parallel I/O techniques that operate di-rectly on native simulation output. We demonstrate DStep by performing teleconnection analysis across ensemble runs of terascale atmospheric CO2 and climate data, and we show scalability results on up to 65,536 IBM BlueGene/P cores.