Results 1 - 10
of
21
Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems
, 2012
"... We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributed-memory machine, and use a heterogeneous multi-level block cyclic ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
(Show Context)
We present a new approach to utilizing all CPU cores and all GPUs on heterogeneous multicore and multi-GPU systems to support dense matrix computations efficiently. The main idea is that we treat a heterogeneous system as a distributed-memory machine, and use a heterogeneous multi-level block cyclic distribution method to allocate data to the host and multiple GPUs to minimize communication. We design heterogeneous algorithms with hybrid tiles to accommodate the processor heterogeneity, and introduce an auto-tuning method to determine the hybrid tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our approach is designed for achieving four objectives: a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our experiments on a compute node (with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs), as well as on up to 100 compute nodes on the Keeneland system [31], demonstrate great scalability, good load balancing, and efficiency of our approach.
Adaptive Input-aware Compilation for Graphics Engines
, 2012
"... Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memor ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Whileg raphics processing units (GPUs) provide low-cost and efficient platforms for accelerating high performance computations, thetediousprocessofperformancetuningrequired tooptimizeapplicationsisanobstacletowideradoptionofGPUs. In addition totheprogrammabilitychallengesposed by GPU’s complex memoryhierarchyandparallelismmodel, a well-known applicationdesignproblemistargetportability across different GPUs. However, evenforasingleGPUtarget, changingaprogram’s inputcharacteristicscanmakeanal ready-optimized implementationofaprogramperformpoorly. Inthiswork, weproposeAdaptic, an adaptive input-awarecompilationsystemtotacklethisimportant,yetoverlooked, inputportabilityproblem.Usingthissystem,programmers developtheirapplicationsinahigh-levelstreaminglanguageand letAdapticundertakethedifficulttaskofinputportableoptimizationsandcodegeneration.Severalinput-awareoptimizationsare introducedtomakeefficientuseofthememoryhierarchyandcustomizethreadcomposition.Atruntime,aproperlyoptimizedversion oftheapplicationisexecutedbasedontheactualprograminput.We performahead-to-headcomparisonbetweentheAdapticgenerated andhand-optimizedCUDAprograms.TheresultsshowthatAdaptic iscapableofgeneratingcodesthatcanperformonparwiththeir hand-optimizedcounterpartsovercertaininputrangesandout performthemwhentheinputfallsoutofthehand-optimizedprograms’ “comfort zone”.Furthermore,weshowthatinput-awareresultsare sustainableacrossdifferentGPUtargets making it possible to write and optimize applications once and run them anywhere.
Enhancing data locality for dynamic simulations through asynchronous data transformations and adaptive control
- In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT
, 2011
"... Abstract—Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to im ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine the implications that modern heterogeneous Chip Multiprocessors (CMP) architecture imposes on the optimization paradigm. We develop three techniques to enhance the optimizations. The first is asynchronous data transformation, which moves data reordering off the critical path through dependence circumvention. The second is a novel data transformation algorithm, named TLayout, designed specially to take advantage of modern throughput-oriented processors. Together they provide two complementary ways to attack a benefit-overhead dilemma inherited in traditional techniques. Working with a dynamic adaptation scheme, the techniques produce significant performance improvement for a set of dynamic simulation benchmarks. I.
Efficient irregular wavefront propagation algorithms on hybrid cpu-gpu machines
- Parallel Computing
, 2013
"... In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the w ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wave-front propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization over-heads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing opera-tions: morphological reconstruction and euclidean distance transform. Our re-sults show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50 × and 85 × with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.
Cooperative Heterogeneous Computing for Parallel Processing on CPU/GPU Hybrids
"... This paper presents a cooperative heterogeneous com-puting framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The pro-posed system exploits at runtime the coarse-grain thread-level parallelism ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
This paper presents a cooperative heterogeneous com-puting framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The pro-posed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups as high as 3.08 compared to the baseline GPU-only processing. 1.
Scheduling concurrent applications on a cluster of cpu-gpu nodes
- In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), CCGRID ’12
, 2012
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Comparative performance analysis of intel xeon phi, gpu, and cpu. arXiv preprint arXiv:1311.0378
, 2013
"... Abstract—We investigate and characterize the performance of an important class of operations on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—We investigate and characterize the performance of an important class of operations on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy image scanners. We identify the data access and computation patterns of operations in object segmentation and feature computation categories. We systematically implement and eval-uate the performance of these core operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that (1) the data access pattern and parallelization strategy employed by the operations strongly affect their performance. While the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU; (2) GPUs are significantly more efficient than MICs for operations and algorithms that irregularly access data. This is a result of the low performance of the latter when it comes to random data access; (3) adequate coordinated execution on MICs and CPUs using a performance aware task scheduling strategy improves about 1.29 × over a first-come-first-served strategy. The example application attained an efficiency of 84 % in an execution with of 192 nodes (3072 CPU cores and 192 MICs). I.
High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms
"... Abstract—We propose, implement, and experimentally evalu-ate a runtime middleware to support high-throughput execution on hybrid cluster machines of large-scale analysis applications. A hybrid cluster machine consists of computation nodes which have multiple CPUs and general purpose graphics process ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—We propose, implement, and experimentally evalu-ate a runtime middleware to support high-throughput execution on hybrid cluster machines of large-scale analysis applications. A hybrid cluster machine consists of computation nodes which have multiple CPUs and general purpose graphics processing units (GPUs). Our work targets scientific analysis applications in which datasets are processed in application-specific data chunks, and the processing of a data chunk is expressed as a hierarchical pipeline of operations. The proposed middleware system combines a bag-of-tasks style execution with coarse-grain dataflow execution. Data chunks and associated data processing pipelines are scheduled across cluster nodes using a demand driven approach, while within a node operations in a given pipeline instance are scheduled across CPUs and GPUs. The runtime system implements several optimizations, including performance aware task scheduling, architecture aware process placement, data locality conscious task assignment, and data prefetching and asynchronous data copy, to maximize utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. The application and performance benefits of the runtime middleware are demonstrated using an image analysis application, which is employed in a brain cancer study, on a state-of-the-art hybrid cluster in which each node has two 6-core CPUs and three GPUs. Our results show that implementing and scheduling application data processing as a set of fine-grain operations provide more opportunities for runtime optimizations and attain better performance than a coarser-grain, monolithic implementation. The proposed runtime system can achieve high-throughput processing of large datasets – we were able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles at about 150 tiles/second rate on 100 nodes. I.
PSkel: A Stencil Programming Framework for CPU-GPU Systems
"... Abstract The use of Graphics Processing Units (GPUs) for high-performance computing has gained growing momentum in recent years. Unfortunately, GPU-programming platforms like CUDA are complex, user unfriendly, and increase the complexity of developing high-performance parallel applications. In addi ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract The use of Graphics Processing Units (GPUs) for high-performance computing has gained growing momentum in recent years. Unfortunately, GPU-programming platforms like CUDA are complex, user unfriendly, and increase the complexity of developing high-performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU-GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high-level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high-level abstraction for stencil programming on heterogeneous CPU-GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel TBB and NVIDIA CUDA. In our experiments, we observed parallel applications with task partitioning can improve average performance by up to 76% and 28% compared to CPU-only and GPU-only parallel applications, respectively.
One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation
"... As an approach to promoting whole-system synergy on a heterogeneous computing system, compilation of fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPU has drawn some recent attentions. This paper concentrates on two important sources of inefficiency that limit existing translat ..."
Abstract
- Add to MetaCart
(Show Context)
As an approach to promoting whole-system synergy on a heterogeneous computing system, compilation of fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPU has drawn some recent attentions. This paper concentrates on two important sources of inefficiency that limit existing translators. The first is overly strong synchronizations; the second is thread-level partially redundant computations. In this paper, we point out that both kinds of inefficiency essentially come from a single reason: the nonuniformity among threads. Based on that observation, we present a thread-level dependence analysis, which leads to a code generator with three novel features: an instance-level instruction scheduler for synchronization relaxation, a graph pattern recognition scheme for code shape optimization, and a fine-grained analysis for thread-level partial redundancy removal. Experiments show that the unified solution is effective in resolving both inefficiencies, yielding speedup as much as a factor of 14.