Results 11 - 20
of
51
Scheduling Task Parallelism on Multi-Socket Multicore Systems
"... The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome development for scientific computing as supercomput ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome development for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that leverages different methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO
Scioto: A Frameworkfor Global-View TaskParallelism ∗
"... We introduce Scioto, Shared Collections of Task Objects, a lightweight framework for providing task management on distributed memory machines under one-sided and globalview parallel programming models. Scioto provides locality aware dynamic load balancing and interoperates with MPI,ARMCI,andGlobalAr ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We introduce Scioto, Shared Collections of Task Objects, a lightweight framework for providing task management on distributed memory machines under one-sided and globalview parallel programming models. Scioto provides locality aware dynamic load balancing and interoperates with MPI,ARMCI,andGlobalArrays. Additionally,Scioto’stask modelandprogramminginterfacearecompatiblewithmany otherexistingparallelmodelsincludingUPC,SHMEM,and CAF. Through task parallelism, the Scioto framework providesasolutionforovercomingirregularity,loadimbalance, and heterogeneity as well as dynamic mapping of computation onto emerging architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrateitseffectivenessontheUnbalancedTreeSearch (UTS) benchmark and two quantum chemistry codes: the closedshellSelf-ConsistentField(SCF)methodandasparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneousandmulticoreclusters,andscalestohundreds ofprocessors. 1
Task Parallelism and Synchronization: An Overview of Explicit Parallel Programming Languages
"... Abstract. Programming parallel machines as effectively as sequential ones would ideally require a language that provides high-level programming constructs in order to avoid the programming errors frequent when expressing parallelism. Since task parallelism is often considered more error-prone than d ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Programming parallel machines as effectively as sequential ones would ideally require a language that provides high-level programming constructs in order to avoid the programming errors frequent when expressing parallelism. Since task parallelism is often considered more error-prone than data parallelism, we survey six popular and efficient parallel programming languages that tackle this difficult issue: Cilk, Chapel, X10, Habanero-Java, OpenMP and OpenCL. Using as single running example a parallel implementation of the computation of the Mandelbrot set, this paper describes how the fundamentals of task parallel programming, namely collective and point-to-point synchronization and mutual exclusion, are dealt with in these languages. Our study suggests that, even though there is a wealth of various names and notions introduced by these languages, they all boil down to three key task concepts: creation, synchronization and atomicity. The paper is designed to give users and language and compiler designers an overview of current parallel languages. 1
SPIRE: A Sequential to Parallel Intermediate Representation Extension
, 2012
"... Abstract. SPIRE is a new, generic, parallel extension for the intermediate representations used in compilation frameworks of sequential languages; it intends to leverage easily their existing infrastructure to address both control and data parallel languages. Since the efficiency and power of the tr ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. SPIRE is a new, generic, parallel extension for the intermediate representations used in compilation frameworks of sequential languages; it intends to leverage easily their existing infrastructure to address both control and data parallel languages. Since the efficiency and power of the transformations and optimizations performed by compilers are closely related to the presence of a suitable program representation, SPIRE introduces both abstract and low-level features adapted to a wide spectrum of optimization goals; a formal definition of its operational semantics is also provided. We use the PIPS intermediate representation as a use case for our approach, extend it with SPIRE parallel primitives and show how examples from the OpenCL, Habanero-Java and X10 parallel languages can be dealt with. Our goal with the development of SPIRE is to provide a powerful parallel program representation that will ease the design of efficient task partitioning applications and, more generally, to draw a possible roadmap for the compiler designers who need to introduce parallel features into their own infrastructures. 1
Handling Task Dependencies Under Strided and Aliased References
"... The emergence of multicore processors has increased the need for simple parallel programming models usable by nonexperts. The ability to specify subparts of a bigger data structure is an important trait of High Productivity Programming Languages. Such a concept can also be applied to dependency-awar ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The emergence of multicore processors has increased the need for simple parallel programming models usable by nonexperts. The ability to specify subparts of a bigger data structure is an important trait of High Productivity Programming Languages. Such a concept can also be applied to dependency-aware task-parallel programming models. In those paradigms, tasks may have data dependencies, and those are used for scheduling them in parallel. However, calculating dependencies between subparts of bigger data structures is challenging. Accessed data may be strided, and can fully or partially overlap the accesses of other tasks. Techniques that are too approximate may produce too many extra dependencies and limit parallelism. Techniques that are too precise may be impractical in terms of time and space. We present the abstractions, data structures and algorithms to calculate dependencies between tasks with strided and possibly different memory access patterns. Our technique is performed at run time from a description of the inputs and outputs of each task and is not affected by pointer arithmetic nor reshaping. We demonstrate how it can be applied to increase programming productivity. We also demonstrate that scalability is comparable to other solutions and in some cases higher due to better parallelism extraction.
A Parallel Numerical Solver Using Hierarchically Tiled Arrays
"... Abstract. Solving linear systems is an important problem for scientific computing. Exploiting parallelism is essential for solving complex systems, and this traditionally involves writing parallel algorithms on top of a library such as MPI. The SPIKE family of algorithms is one well-known example of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Solving linear systems is an important problem for scientific computing. Exploiting parallelism is essential for solving complex systems, and this traditionally involves writing parallel algorithms on top of a library such as MPI. The SPIKE family of algorithms is one well-known example of a parallel solver for linear systems. The Hierarchically Tiled Array data type extends traditional data-parallel array operations with explicit tiling and allows programmers to directly manipulate tiles. The tiles of the HTA data type map naturally to the block nature of many numeric computations, including the SPIKE family of algorithms. The higher level of abstraction of the HTA enables the same program to be portable across different platforms. Current implementations target both shared-memory and distributed-memory models. In this paper we present a proof-of-concept for portable linear solvers. We implement two algorithms from the SPIKE family using the HTA library. We show that our implementations of SPIKE exploit the abstractions provided by the HTA to produce a compact, clean code that can run on both shared-memory and distributedmemory models without modification. We discuss how we map the algorithms to HTA programs as well as examine their performance. We compare the performance of our HTA codes to comparable codes written in MPI as well as current state-of-the-art linear algebra routines. 1
Legion: Expressing Locality and Independence with Logical Regions
"... Abstract—Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and inde ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions. We describe a runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism. Legion also enables explicit, programmer controlled movement of data through the memory hierarchy and placement of tasks based on locality information via a novel mapping interface. We evaluate our Legion implementation on three applications: fluid-flow on a regular grid, a three-level AMR code solving a heat diffusion equation, and a circuit simulation. I.
OPTIMISING COMPONENT COMPOSITION USING INDEXED DEPENDENCE METADATA
"... This paper explores the use of dependence metadata for optimising composition in component-based parallel programs. The idea is for each component to carry additional information about how points in its iteration space map to memory locations associated with its input and output data structures. Whe ..."
Abstract
- Add to MetaCart
This paper explores the use of dependence metadata for optimising composition in component-based parallel programs. The idea is for each component to carry additional information about how points in its iteration space map to memory locations associated with its input and output data structures. When two components are composed this information can be used to implement optimisations that would otherwise require expensive analysis of the components ’ code at the time of composition. This dependence metadata facilitates a number of cross-component optimisations – in this paper we focus on loop fusion and array contraction. We describe a prototype framework, based on the CLooG loop generator tool, that embodies these ideas and report experimental performance results for three non-trivial parallel benchmarks. Our results show execution time reductions of up to 50% using the proposed framework on an 8 core xeon. 1.
Prepared by
, 2008
"... Reports produced after January 1, 1996, are generally available free via the U.S. Department of Energy (DOE) Information Bridge: Web Site: ..."
Abstract
- Add to MetaCart
Reports produced after January 1, 1996, are generally available free via the U.S. Department of Energy (DOE) Information Bridge: Web Site:
As-If-Serial Exception Handling Semantics for Java Futures
"... Exception handling enables programmers to specify the behavior of a program when an exceptional event occurs at runtime. Exception handling, thus, facilitates software fault tolerance and the production of reliable and robust software systems. With the recent emergence of multi-processor systems and ..."
Abstract
- Add to MetaCart
Exception handling enables programmers to specify the behavior of a program when an exceptional event occurs at runtime. Exception handling, thus, facilitates software fault tolerance and the production of reliable and robust software systems. With the recent emergence of multi-processor systems and parallel programming constructs, techniques are needed that provide exception handling support in these environments that is intuitive and easy to use. Unfortunately, extant semantics of exception handling for concurrent settings are significantly more complex to reason about than their serial counterparts. In this paper, we investigate an similarly intuitive semantics for exception handling for the future parallel programming construct in Java. Futures are used by programmers to identify potentially asynchronous computations and to introduce parallelism into sequential programs. The intent of futures is to provide some performance benefits through the use of method-level concurrency while maintaining as-if-serial semantics that novice programmers can easily understand – the semantics of a program with futures is the same as that for an equivalent serial version of the program. We extend this model to provide as-if-serial exception handling semantics. Using this model our runtime delivers exceptions to the same point it would deliver them if the program was executed sequentially. We present the design and implementation of our approach and evaluate its efficiency using an open source Java virtual machine.

