Results 1 - 10
of
32
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
"... Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU s ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous computing, called Asymmetric Distributed Shared Memory (ADSM), that maintains a shared logical memory space for CPUs to access objects in the accelerator physical memory but not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of symmetrical distributed shared memory systems. ADSM allows programmers to assign data objects to performance critical methods. When a method is selected for accelerator execution, its associated data objects are allocated within the shared logical memory space, which is hosted in the accelerator physical memory and transparently accessible by the methods executed on CPUs. We argue that ADSM reduces programming efforts for heterogeneous computing systems and enhances application portability. We present a software implementation of ADSM, called GMAC, on top of CUDA in a GNU/Linux environment. We show that applications written in ADSM and running on top of GMAC achieve performance comparable to their counterparts using programmermanaged data transfers. This paper presents the GMAC system and evaluates different design choices. We further suggest additional architectural support that will likely allow GMAC to achieve higher application performance than the current CUDA model.
CellMR: A Framework for Supporting MapReduce on Asymmetric Cell-Based Clusters
"... The use of asymmetric multi-core processors with onchip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on making efficient use of individual systems, and porting applicati ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The use of asymmetric multi-core processors with onchip computational accelerators is becoming common in a variety of environments ranging from scientific computing to enterprise applications. The focus of current research has been on making efficient use of individual systems, and porting applications to asymmetric processors. In this paper, we take the next step by investigating the use of multi-corebased systems, especially the popular Cell processor, in a cluster setting. We present CellMR, an efficient and scalable implementation of the MapReduce framework for asymmetric Cell-based clusters. The novelty of CellMR lies in its adoption of a streaming approach to supporting MapReduce, and its adaptive resource scheduling schemes: Instead of allocating workloads to the components once, CellMR slices the input into small work units and streams them to the asymmetric nodes for efficient processing. Moreover, CellMR removes I/O bottlenecks by design, using a number of techniques, such as double-buffering and asynchronous I/O, to maximize cluster performance. Our evaluation of CellMR using typical MapReduce applications shows that it achieves 50.5 % better performance compared to the standard nonstreaming approach, introduces a very small overhead on the manager irrespective of application input size, scales almost linearly with increasing number of compute nodes (a speedup of 6.9 on average, when using eight nodes compared to a single node), and adapts effectively the parameters of its resource management policy between applications with varying computation density. 1.
Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers
"... Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written in a way that is portable to many platforms, ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written in a way that is portable to many platforms, but providing this portability for general programs is a hard problem. By restricting the class of programs considered, we can make this portability feasible. We present Liszt, a domainspecific language for constructing mesh-based PDE solvers. We introduce language statements for interacting with an unstructured mesh, and storing data at its elements. Program analysis of these statements enables our compiler to expose the parallelism, locality, and synchronization of Liszt programs. Using this analysis, we generate applications for multiple platforms: a cluster, an SMP, and a GPU. This approach allows Liszt applications to perform within 12 % of hand-written C++, scale to large clusters, and experience order-of-magnitude speedups on GPUs.
Designing Accelerator-Based Distributed Systems for High Performance
, 2010
"... Multi-core processors with accelerators are becoming commodity components for high-performance computing at scale. While accelerator-based processors have been studied in some detail, the design and management of clusters based on these processors have not received the same focus. In this paper, we ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Multi-core processors with accelerators are becoming commodity components for high-performance computing at scale. While accelerator-based processors have been studied in some detail, the design and management of clusters based on these processors have not received the same focus. In this paper, we present an exploration of four design and resource management alternatives, which can be used on largescale asymmetric clusters with accelerators. Moreover, we adapt the popular MapReduce programming model to our proposed configurations. We enhance MapReduce with new dynamic data streaming and workload scheduling capabilities, whichenableapplicationwriterstouseasymmetric acceleratorbased clusters without being concerned with the capabilities of individual components. We present an evaluation of the presented designs in a physical setting and show that our designs can provide significant performance advantages. Compared to a standard static MapReduce design, we achieve 62.5%, 73.1%, and 82.2 % performance improvement using accelerators with limited general-purpose resources, well-provisioned shared general-purpose resources, and well-provisioned dedicated general-purpose resources, respectively.
Efficient High Performance Collective Communication for the Cell Blade ∗
"... This paper presents high-performance collective communication algorithms and implementations that exploit the unique architectural features of the Cell heterogeneous multicore processor. This paper specifically describes novel algorithms for the barrier, broadcast, reduce, all-reduce, and all-gather ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper presents high-performance collective communication algorithms and implementations that exploit the unique architectural features of the Cell heterogeneous multicore processor. This paper specifically describes novel algorithms for the barrier, broadcast, reduce, all-reduce, and all-gather collective operations, and shows the efficiency of these by comparing them to the previous fastest known implementations of these operations targeting the Cell. The new implementations are faster than the published stateof-the-art, achieving up to 19.21 times the performance (95 % reduction in latency) of the previous published collective communication work for the Cell [19, 25]. The results presented show performance both within a chip and across the two Cell chips on a Cell blade [10].
NEC Europe Ltd
"... an interface for one-sided communication, also known as remote memory access (RMA). It was designed with the goal that it should permit efficient implementations on multiple platforms and networking technologies, and also in heterogeneous environments and non-cache-coherent systems. Nonetheless, eve ..."
Abstract
- Add to MetaCart
an interface for one-sided communication, also known as remote memory access (RMA). It was designed with the goal that it should permit efficient implementations on multiple platforms and networking technologies, and also in heterogeneous environments and non-cache-coherent systems. Nonetheless, even 12 years after its existence, the MPI-2 RMA interface remains scarcely used for a number of reasons. This paper discusses the limitations of the MPI-2 RMA specification, outlines the goals and requirements for a new RMA API that would better meet the needs of both users and implementers, and presents a strawman proposal for such an API. We also study the tradeoffs facing the design of this new API and discuss how it may be implemented efficiently on both cache-coherent and non-cache-coherent systems. I.
Vshmem: Shared-Memory OS-Support for Multicore-based HPC systems
"... As a result of the huge performance potential of multi-core microprocessors, HPC infrastructures are rapidly integrating them into their architectures in order to expedite the performance growth of the next generation HPC systems. However, as the number of cores per processor increase to 100 or 1000 ..."
Abstract
- Add to MetaCart
As a result of the huge performance potential of multi-core microprocessors, HPC infrastructures are rapidly integrating them into their architectures in order to expedite the performance growth of the next generation HPC systems. However, as the number of cores per processor increase to 100 or 1000s of cores, they are posing revolutionary challenges to the various aspects of the software stack. In our research, we endeavor to investigate novel solutions to the problem of extracting high-performance. In this paper, we advocate for the use of virtualization as an alternative approach to the traditional operating systems for the next generation multicore-based HPC systems. In particular, we investigate an efficient mechanism for shared-memory communication between HPC applications executing within virtual machine (VM) instances that are co-located on the same hardware platform. This system, called Vshmem, implements low latency IPC communication mechanism that allows the programmer to selectively share memory regions between user-space processes residing in collocated virtual machines. Our contributions addressed I.
Carbon Nanotube Coated High-Throughput Neurointerfaces in Assistive Environments
"... Loosing motor activity due to impaired or damaged nerves or muscles affects millions of people world-wide. The resulting lack of mobility and/or impaired communication bears enormous personal, economical and social costs. While several assistive technologies exist, they rely on device surrogates to ..."
Abstract
- Add to MetaCart
Loosing motor activity due to impaired or damaged nerves or muscles affects millions of people world-wide. The resulting lack of mobility and/or impaired communication bears enormous personal, economical and social costs. While several assistive technologies exist, they rely on device surrogates to compensate for the lack of movement and thus provide limited utility and unnatural interface with the user. The ability of interfacing populations of neurons with super high-density multielectrode arrays (SD-MEA) can provide the sensing from and control of bionics devices by thought. Here we propose a neurointerfacing approach using SD-MEAs coated with carbon nanotubes and high-speed computing to overcome latency and long-term electrical viability bottlenecks that are essential in assistive environments. The proposed approach provides ability for fast integration of recording/stimulation from thousands of individually addressable electrodes, while coordinating a real-time computing approach to register, recognize, analyze and respond appropriately to the biological signals from the motor neurons and sensory signals from the robotic prosthesis. Categories and Subject Descriptors J.3 [Life and Medical Sciences]: High-speed data acquisition,
on a complex multicore platform
, 2010
"... In this report, we consider the problem of scheduling streaming applications described by complex task graphs on a heterogeneous multi-core platform, the IBM QS 22 platform, embedding two STI Cell BE processor. We first derive a complete computation and communication model of the platform, based on ..."
Abstract
- Add to MetaCart
In this report, we consider the problem of scheduling streaming applications described by complex task graphs on a heterogeneous multi-core platform, the IBM QS 22 platform, embedding two STI Cell BE processor. We first derive a complete computation and communication model of the platform, based on comprehensive benchmarks. Then, we use this model to express the problem of maximizing the throughput of a streaming application on this platform. Although the problem is proven NP-complete, we present an optimal solution based on mixed linear programming. We also propose simpler scheduling heuristics to compute mapping of the application task-graph on the platform. We then come back to the platform, and propose a scheduling software to deploy streaming applications on this platform. This allows us to thoroughly test our scheduling strategies on the real platform. We thus show that we are able to achieve a good speed-up, either with the mixed linear programming solution, or using involved scheduling heuristics. Keywords: Scheduling, multicore processor, streaming application, Cell processor.

