Results 1 - 10
of
22
CellSs: a Programming Model for the Cell BE Architecture
- ACM/IEEE CONFERENCE ON SUPERCOMPUTING
, 2006
"... In this work we present Cell superscalar (CellSs) which addresses the automatic exploitation of the functional parallelism of a sequential program through the different processing elements of the Cell BE architecture. The focus in on the simplicity and flexibility of the programming model. Based on ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
In this work we present Cell superscalar (CellSs) which addresses the automatic exploitation of the functional parallelism of a sequential program through the different processing elements of the Cell BE architecture. The focus in on the simplicity and flexibility of the programming model. Based on a simple annotation of the source code, a source to source compiler generates the necessary code and a runtime library exploits the existing parallelism by building at runtime a task dependency graph. The runtime takes care of the task scheduling and data handling between the different processors of this heterogeneous architecture. Besides, a locality-aware task scheduling has been implemented to reduce the overhead of data transfers. The approach has been implemented and tested with a set of examples and the results obtained since now are promising.
Compilation for explicitly managed memory hierarchies
- In PPoPP ’07: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2007
"... We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine. We evaluate the performance of o ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine. We evaluate the performance of our compiler using several benchmarks running on a Cell processor. Categories and Subject Descriptors D.3.4 [Programming Languages]:
A Tuning Framework for Software-Managed Memory Hierarchies
"... Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundr ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine’s particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3’s.
K.: Porting the GROMACS Molecular Dynamics Code to the Cell Processor
- In: Proc. 21st Intl. Parallel and Distr. Processing Symp. (IPDPS 2007
, 2007
"... The Cell processor offers substantial computational power which can be effectively utilized only if application design and implementation are tuned to the Cell architecture. In this paper, we examine application characteristics which facilitate efficient use of the Cell processor, and those which pr ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The Cell processor offers substantial computational power which can be effectively utilized only if application design and implementation are tuned to the Cell architecture. In this paper, we examine application characteristics which facilitate efficient use of the Cell processor, and those which present obstacles to it. Moreover, we consider possible solutions designed to mitigate inefficiencies. The target application in our study is the GROMACS molecular dynamics package. We have accelerated the most-often used compute-intensive kernel while maintaining the constraints imposed by the structure of the surrounding program. The significant contribution of this paper is the consideration of the kernel in the context of a complex end-to-end application, with irregular data and code patterns, rather than an isolated kernel code. For this challenging scenario, our results show a 2X speedup versus hand-tuned VMX/SSE code running on high-end PowerPC and x86 uniprocessor machines. 1
MapReduce on the Cell Broadband Engine Architecture
, 2007
"... In this paper, we propose the evaluation of MapReduce on the Cell processor by way of the Marchine Cubes application. We argue that the Cell architecture and the MapReduce parallel programming model complement each other well, and that the Marching Cubes application is a good application through whi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, we propose the evaluation of MapReduce on the Cell processor by way of the Marchine Cubes application. We argue that the Cell architecture and the MapReduce parallel programming model complement each other well, and that the Marching Cubes application is a good application through which to evaluate this potential synergy. For the interested reader, a preliminary design and plan of evalution are both presented. 1
Mapping and Synchronizing Streaming Applications on Cell Processors
"... Abstract. Developing streaming applications on heterogenous multi-processor architectures like the Cell is difficult. Currently, application developers need to know about hardware details to deal with issues like scheduling, memory management and communication/synchronization. Worse, with multiple a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Developing streaming applications on heterogenous multi-processor architectures like the Cell is difficult. Currently, application developers need to know about hardware details to deal with issues like scheduling, memory management and communication/synchronization. Worse, with multiple alternatives for communication available, developers spend significant time picking the most appropriate one. A poor choice often results in bad performance. With Cell-Space, we shield users from hardware details without compromising performance. Its runtime is based on an evaluation of the different communication primitives. In Cell-Space, developers specify a streaming application as a data flow graph of interacting components. Both task- and data-parallelism are easily expressed and advanced features such as dynamic reconfiguration are fully supported. Beneath a simple interface we include a slew of optimizations not present in other Cell run time environments. We demonstrate the impact of these optimizations and show that Cell-Space applications can efficiently exploit the resources offered by the Cell. 1
Exploring Multi-Grained Parallelism in Compute- Intensive DEVS Simulations
"... Abstract—We propose a computing technique for efficient parallel simulation of compute-intensive DEVS models on the IBM Cell processor, combining multi-grained parallelism and various optimizations to speed up the event execution. Unlike most existing parallelization strategies, our approach explici ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—We propose a computing technique for efficient parallel simulation of compute-intensive DEVS models on the IBM Cell processor, combining multi-grained parallelism and various optimizations to speed up the event execution. Unlike most existing parallelization strategies, our approach explicitly exploits the massive fine-grained event-level parallelism inherent in the simulation process, while most of the logical processes are virtualized, making the achievable parallelism more deterministic and predictable. Together, the parallelization and optimization strategies produced promising experimental results, accelerating the simulation of a 3D environmental model by a factor of up to 33.06. The proposed methods can also be applied to other multicore and shared-memory architectures. Keywords-DEVS formalism; Cell-DEVS formalism; multigrained parallelism; multicore computing; Cell processor
Design and Implementation of Software-Managed Caches for Multicores with Local Memory ∗
"... Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore architectures must explicitly manage data transfers between the local memory of a core and the globally shared main memory. This is a tedious and errorprone programming task. A software-managed cache (SMC), implemented in local memory, can be programmed to automatically handle data transfers at runtime, thus simplifying the task of the programmer. In this paper, we propose a new software-managed cache design, called extended set-index cache (ESC). It has the benefits of both set-associative and fully associative caches. Its tag search speed is comparable to the set-associative cache and its miss rate is comparable to the fully associative cache. We examine various line replacement policies for SMCs, and discuss their trade-offs. In addition, we propose adaptive execution strategies that select the optimal cache line size and replacement policy for each program region at runtime. To evaluate the effectiveness of our approach, we implement the ESC and other SMC designs on the Cell BE architecture, and measure their performance with 8 OpenMP applications. The evaluation results show that the ESC outperforms other SMC designs. The results also show that our adaptive execution strategies work well with the ESC. In fact, our approach is applicable to all cores with access to both local and global memory in a multicore architecture. 1
An Experimental Study of Sorting and Branch Prediction
"... Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures tha ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures that significantly influence performance. Caches and branch predictors are two such features, and while there has been a significant amount of research into the cache performance of general purpose sorting algorithms, there has been little research on their branch prediction properties. In this paper we empirically examine the behaviour of the branches in all the most common sorting algorithms. We also consider the interaction of cache optimization on the predictability of the branches in these algorithms. We find insertion sort to have the fewest branch mispredictions of any comparison-based sorting algorithm, that bubble and shaker sort operate in a fashion which makes their branches highly unpredictable, that the unpredictability of shellsort’s branches improves its caching behaviour and that several cache optimizations have little effect on mergesort’s branch mispredictions. We find also that optimizations to quicksort – for example the choice of pivot – have a strong influence on the predictability of its branches. We point out a simple way of removing branch instructions from a classic heapsort implementation, and show also that unrolling a loop in a cache optimized heapsort implementation improves the predicitability of its branches. Finally, we note that when sorting random data two-level adaptive branch predictors are usually no better than simpler bimodal predictors. This is despite the fact that two-level adaptive predictors are almost always superior to bimodal predictors in general.
Optimized Mapping of Pipelined Task Graphs on the Cell BE ⋆
"... Abstract. Limited bandwidth to off-chip main memory poses a problem in chip multiprocessors for streaming applications, such as Cell BE, and will become more severe with the expected increase in the number of cores. Especially for streaming computations where the ratio between computational work and ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Limited bandwidth to off-chip main memory poses a problem in chip multiprocessors for streaming applications, such as Cell BE, and will become more severe with the expected increase in the number of cores. Especially for streaming computations where the ratio between computational work and memory transfer is low, the generation of memory-efficient code is thus an important compiler optimization. We suggest to use pipelining between the SPEs over the high-bandwidth internal bus of Cell BE to reduce the required main memory bandwidth, and thereby improve the computation throughput for memory-intensive computations. At the same time, we are constrained by the limited size of SPE on-chip memory available for additional buffers that are necessary for the pipelining between SPEs. We investigate mappings of the nodes of a pipelined parallel task graph to the SPEs that are optimal trade-offs between load balancing, buffer memory consumption, and communication load on the on-chip bus. We solve this multiobjective optimization problem by deriving an integer linear programming (ILP) formulation and compute Pareto-optimal solutions for the mapping with a stateof-the-art ILP solver. For larger problem instances, we sketch a two-step approach to reduce problem size. We exemplify our mapping technique with several memory-intensive example problems: with acyclic pipelined task graphs derived from data parallel code, with complete d-ary tree pipelines for parallel mergesort on Cell BE, and with butterfly pipelines for parallel FFT on Cell BE. We validate the mappings with discrete event simulations. 1

