Results 11 - 20
of
125
Compilation for explicitly managed memory hierarchies
- In PPoPP ’07: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 2007
"... We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine. We evaluate the performance of o ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of the machine. We evaluate the performance of our compiler using several benchmarks running on a Cell processor. Categories and Subject Descriptors D.3.4 [Programming Languages]:
Transactional Execution of Java Programs
, 2005
"... Parallel programming is difficult due to the complexity of dealing with conventional lock-based synchronization. To simplify parallel programming, there have been a number of proposals to support transactions directly in hardware and eliminate locks completely. Although hardware support for transact ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Parallel programming is difficult due to the complexity of dealing with conventional lock-based synchronization. To simplify parallel programming, there have been a number of proposals to support transactions directly in hardware and eliminate locks completely. Although hardware support for transactions has the potential to completely change the way parallel programs are written, initially transactions will be used to execute existing parallel programs. In this paper we investigate the implications of using transactions to execute existing parallel Java programs. Our results show that transactions can be used to support all aspects of Java multithreaded programs. Moreover, the conversion of a lock-based application into transactions is largely straightforward. The performance that these converted applications achieve is equal to or sometimes better than the original lock-based implementation.
DiSTM: A software transactional memory framework for clusters
- In Proc. of the International Conference on Parallel Processing (ICPP
, 2008
"... While Transactional Memory (TM) research on sharedmemory chip multiprocessors has been flourishing over the last years, limited research has been conducted in the cluster domain. In this paper, we introduce a research platform for exploiting software TM on clusters. The Distributed Software Transact ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
While Transactional Memory (TM) research on sharedmemory chip multiprocessors has been flourishing over the last years, limited research has been conducted in the cluster domain. In this paper, we introduce a research platform for exploiting software TM on clusters. The Distributed Software Transactional Memory (DiSTM) system has been designed for easy prototyping of TM coherence protocols and it does not rely on a software or hardware implementation of distributed shared memory. Three TM coherence protocols have been implemented and evaluated with established TM benchmarks. The decentralized Transactional Coherence and Consistency protocol has been compared against two centralized protocols that utilize leases. Results indicate that depending on network congestion and amount of contention different protocols perform better. 1.
CAPSULE: Hardware-assisted parallel execution of componentbased programs
- In Proceedings of the 39th Annual International Symposium on Microarchitecture
, 2006
"... Since processor performance scalability will now mostly be achieved through thread-level parallelism, there is a strong incentive to parallelize a broad range of applications, including those with complex control flow and data structures. And writing parallel programs is a notoriously difficult task ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Since processor performance scalability will now mostly be achieved through thread-level parallelism, there is a strong incentive to parallelize a broad range of applications, including those with complex control flow and data structures. And writing parallel programs is a notoriously difficult task. Beyond processor performance, the architect can help by facilitating the task of the programmer, especially by simplifying the model exposed to the programmer. In this article, among the many issues associated with writing parallel programs, we focus on finding the appropriate parallelism granularity, and efficiently mapping tasks with complex control and data flow to threads. We propose to relieve the user and compiler of both tasks by delegating the parallelization decision to the architecture at run-time, through a combination of hardware and
The Parallel Computing Laboratory at U.C. Berkeley: A Research Agenda Based on the Berkeley View
, 2008
"... Copyright © 2008, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Copyright © 2008, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.
Supinski, “Toward Enhancing OpenMP’s Work-Sharing Directives
- in the Euro-Par06 Conference
, 2006
"... Abstract. OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requies greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. ..."
Abstract
-
Cited by 8 (6 self)
- Add to MetaCart
Abstract. OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requies greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application performance on emerging and large-scale platforms while preserving ease of programming. 1
Qthreads: An api for programming with millions of lightweight threads
- In IPDPS
, 2008
"... Large scale hardware-supported multithreading, an attractive means of increasing computational power, benefits significantly from low per-thread costs. Hardware support for lightweight threads is a developing area of research. Each architecture with such support provides a unique interface, hinderin ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Large scale hardware-supported multithreading, an attractive means of increasing computational power, benefits significantly from low per-thread costs. Hardware support for lightweight threads is a developing area of research. Each architecture with such support provides a unique interface, hindering development for them and comparisons between them. A portable abstraction that provides basic lightweight thread control and synchronization primitives is needed. Such an abstraction would assist in exploring both the architectural needs of large scale threading and the semantic power of existing languages. Managing thread resources is a problem that must be addressed if massive parallelism is to be popularized. The qthread abstraction enables development of large-scale multithreading applications on commodity architectures. This paper introduces the qthread API and its Unix implementation, discusses resource management, and presents performance results from the HPCCG benchmark. 1.
J.W.: Towards a safer interaction with transactional memory by tracking object visibility
- In: SCOOL ’05, Workshop on Synchronization and Concurrency in Object-Oriented Languages
, 2005
"... Lately there has been an increasing interest in Transactional Memory (TM), a programming API that helps programmers writing scalable concurrent programs using sequential code. It is well known that writing concurrent programs using locks is a difficult task: coarse grained locking results in poor pe ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Lately there has been an increasing interest in Transactional Memory (TM), a programming API that helps programmers writing scalable concurrent programs using sequential code. It is well known that writing concurrent programs using locks is a difficult task: coarse grained locking results in poor performance, and fine grained locking introduces the risk of deadlocking, and makes program’s maintenance difficult. With TM, the programmer only specifies what is required to be executed atomically, without worrying about the synchronization required to achieve this task. There are suggested implementations for TM both in hardware (HTM) and software (STM). While HTM is much faster than STM, it also has more limitations than STM does, and therefore it is not likely that we will have a widely available TM that is implemented purely in hardware in the near future. While STM can compensate for most of the HTM limitations, it imposes a new difficulty on the programmer: STM does not allow concurrent access to an object by transactional and non-transactional code: that is, if an object is accessed using regular read and write operations while it is also being accessed by a transaction, the atomicity of the transaction involving this object might break. Requiring the programmer to keep track of which objects may be involved in transactions will only result in more error-prone concurrent programs. On the other hand, accessing all objects by only transactional code would be safe, but probably not practical as long as we do not have an efficient and robust pure HTM so-Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Language and Virtual Machine Support for Efficient Fine-Grained Futures in Java
- In International Conference on Parallel Architectures and Compilation Techniques (PACT
, 2007
"... In this work, we investigate the implementation of futures in Java J2SE v5.0. Java 5.0 provides an interface-based implementation of futures that enables users to encapsulate potentially asynchronous computation and to define their own execution engines for futures. Although this methodology decoupl ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
In this work, we investigate the implementation of futures in Java J2SE v5.0. Java 5.0 provides an interface-based implementation of futures that enables users to encapsulate potentially asynchronous computation and to define their own execution engines for futures. Although this methodology decouples thread scheduling from application logic, for applications with fine-grained parallelism, this model imposes an undue burden on the average users and introduces significant performance overhead. To address these issues, we investigate the use of lazy futures and offer an alternative implementation to the Java 5.0 approach. In particular, we present a directive-based programming model for using futures in Java that uses annotations in Java 5.0 (as opposed to interfaces) and a lazy future implementation to significantly simplify programmer effort. Our directive-based future system employs novel compilation and runtime techniques that transparently and adaptively split and spawn futures for parallel execution. All such decisions are automatic and guided by dynamically determined future granularity and underlying resource availability. We empirically evaluate our future implementation using different Java Virtual Machine configurations and common Java benchmarks that implement fine-grained parallelism. We compare directive-based lazy futures with lazy and Java 5.0 futures and show that our approach is significantly more scalable. 1.
Programmability of the HPCS Languages: A Case Study with a Quantum Chemistry Kernel ∗
"... As high-end computer systems present users with rapidly increasing numbers of processors, possibly also incorporating attached co-processors, programmers are increasingly challenged to express the necessary levels of concurrency with the dominant parallel programming model, Fortran+MPI+OpenMP (or mi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
As high-end computer systems present users with rapidly increasing numbers of processors, possibly also incorporating attached co-processors, programmers are increasingly challenged to express the necessary levels of concurrency with the dominant parallel programming model, Fortran+MPI+OpenMP (or minor variations). In this paper, we examine the languages developed under the DARPA High-Productivity Computing Systems (HPCS) program (Chapel, Fortress, and X10) as representatives of a different parallel programming model which might be more effective on emerging high-performance systems. The application used in this study is the Hartree-Fock method from quantum chemistry, which combines access to distributed data with a task-parallel algorithm and is characterized by significant irregularity in the computational tasks. We present several different implementation strategies for load balancing of the task-parallel computation, as well as distributed array operations, in each of the three languages. We conclude that the HPCS languages provide a wide variety of mechanisms for expressing parallelism, which can be combined at multiple levels, making them quite expressive for this problem. 1

