Results 11 - 20
of
47
Language Virtualization for Heterogeneous Parallel Computing
"... As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatible mix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficult problems of parallelization, data decomp ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
As heterogeneous parallel systems become dominant, application developers are being forced to turn to an incompatible mix of low level programming models (e.g. OpenMP, MPI, CUDA, OpenCL). However, these models do little to shield developers from the difficult problems of parallelization, data decomposition and machine-specific details. Most programmers are having a difficult time using these programming models effectively. To provide a programming model that addresses the productivity and performance requirements for the average programmer, we explore a domainspecific approach to heterogeneous parallel programming. We propose language virtualization as a new principle that enables the construction of highly efficient parallel domain specific languages that are embedded in a common host language. We define criteria for language virtualization and present techniques to achieve them. We present two concrete case studies of domain-specific languages that are implemented using our virtualization approach.
Space Profiling for Parallel Functional Programs
"... This paper presents a semantic space profiler for parallel functional programs. Building on previous work in sequential profiling, our tools help programmers to relate runtime resource use back to program source code. Unlike many profiling tools, our profiler is based on a cost semantics. This provi ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper presents a semantic space profiler for parallel functional programs. Building on previous work in sequential profiling, our tools help programmers to relate runtime resource use back to program source code. Unlike many profiling tools, our profiler is based on a cost semantics. This provides a means to reason about performance without requiring a detailed understanding of the compiler or runtime system. It also provides a specification for language implementers. This is critical in that it enables us to separate cleanly the performance of the application from that of the language implementation. Some aspects of the implementation can have significant effects on performance. Our cost semantics enables programmers to understand the impact of different scheduling policies yet abstracts away from many of the details of their implementations. We show applications where the choice of scheduling policy has asymptotic effects on space use. We explain these use patterns through a demonstration of our tools. We also validate our methodology by observing similar performance in our implementation of a parallel extension of Standard ML.
Space-profiling semantics of the call-by-value lambda calculus and the CPS transformation
- In The 3rd International Workshop on Higher Order Operational Techniques in Semantics, volume 26 of Electronic Notes in Theoretical Computer Science
, 1999
"... We show that the CPS transformation from the call-by-value lambda calculus to a CPS language preserves space required for execution of a program within a constant factor. For the call-by-value lambda calculus we adopt a space-profiling semantics based on the profiling semantics of NESL by Blelloch a ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
We show that the CPS transformation from the call-by-value lambda calculus to a CPS language preserves space required for execution of a program within a constant factor. For the call-by-value lambda calculus we adopt a space-profiling semantics based on the profiling semantics of NESL by Blelloch and Greiner. However, we have noticed their semantics has some inconsistency between the treatments of stack space and heap space. This requires us to revise the semantics so that the semantics treats space in more consistent manner in order to obtain our result. 1
Scalable Real-time Parallel Garbage Collection for Symmetric Multiprocessors
, 2001
"... model for garbage collection. ..."
Nested Parallelism in Transactional Memory
"... Abstract This paper describes XCilk, a runtime-system design forsoftware transactional memory in a Cilk-like parallel programming language, which uses a work-stealing sched-uler. XCilk supports transactions that themselves can contain nested parallelism and nested transactions, both of un-bounded ne ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract This paper describes XCilk, a runtime-system design forsoftware transactional memory in a Cilk-like parallel programming language, which uses a work-stealing sched-uler. XCilk supports transactions that themselves can contain nested parallelism and nested transactions, both of un-bounded nesting depth. Thus, XCilk allows users to call a library function within a transaction, even if that func-tion itself exploits concurrency and uses transactions. XCilk provides transactional memory with strong atomicity, eagerupdates, eager conflict detection and lazy cleanup on aborts. XCilk uses a new algorithm and data structure, calledXConflict, to facilitate conflict detection between transactions. Using XConflict, XCilk guarantees provably good per-formance for closed-nested transactions of arbitrary depth in the special case when all accesses are writes and there is nomemory contention. More precisely, XCilk executes a program with work T1 and critical-path length Te ^ in O(T1/p +pT e^) time on p processors if all memory accesses are writesand all concurrent paths of the XCilk program access disjoint sets of memory locations. Although this bound holdsonly under rather optimistic assumptions, to our knowledge, this result is the first theoretical performance bound on a TMsystem that supports transactions with nested parallelism.
Adaptive Work Stealing with Parallelism Feedback
"... Abstract We present an adaptive work-stealing thread scheduler, A-STEAL, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealinglibrary. The A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multipr ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract We present an adaptive work-stealing thread scheduler, A-STEAL, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealinglibrary. The A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessorresource and in which the number of processors available to a particular job may vary during the job's execution. A-STEALprovides continual parallelism feedback to a job scheduler in the form of processor requests, and the job must adapt its ex-ecution to the processors allotted to it. Assuming that the job scheduler never allots any job more processors than requestedby the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a con-stant fraction of the allotted processors. Our analysis models the job scheduler as the thread sched-uler's adversary, challenging the thread scheduler to be robust to the system environment and the job scheduler's administra-tive policies. We analyze the performance of A-STEAL using "trim analysis, " which allows us to prove that our thread sched-uler performs poorly on at most a small number of time steps, while exhibiting near-optimal behavior on the vast majority.To be precise, suppose that a job has work T1 and critical-path length T1. On a machine with P processors, A-STEALcompletes the job in expected O(T1/eP + T1 + L lg P) timesteps, where L is the length of a scheduling quantum and ePdenotes the O(T1 + L lg P)-trimmed availability. This quan-tity is the average of the processor availability over all but
Software challenges in extreme scale systems
- Journal of Physics: Conference Series
, 2009
"... Abstract. Computer systems anticipated in the 2015 – 2020 timeframe are referred to as Extreme Scale because they will be built using massive multi-core processors with 100’s of cores per chip. The largest capability Extreme Scale system is expected to deliver Exascale performance of the order of 10 ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Abstract. Computer systems anticipated in the 2015 – 2020 timeframe are referred to as Extreme Scale because they will be built using massive multi-core processors with 100’s of cores per chip. The largest capability Extreme Scale system is expected to deliver Exascale performance of the order of 10 18 operations per second. These systems pose new critical challenges for software in the areas of concurrency, energy efficiency and resiliency. In this paper, we discuss the implications of the concurrency and energy efficiency challenges on future software for Extreme Scale Systems. From an application viewpoint, the concurrency and energy challenges boil down to the ability to express and manage parallelism and locality by exploring a range of strong scaling and new-era weak scaling techniques. For expressing parallelism and locality, the key challenges are the ability to expose all of the intrinsic parallelism and locality in a programming model, while ensuring that this expression of parallelism and locality is portable across a range of systems. For managing parallelism and locality, the OS-related challenges include parallel scalability, spatial partitioning of OS and application functionality, direct hardware access for inter-processor communication, and asynchronous rather than interrupt-driven events, which are accompanied by runtime system challenges for scheduling, synchronization, memory management, communication, performance monitoring, and power management. We conclude by discussing the importance of software-hardware codesign in addressing the fundamental challenges for application enablement on Extreme Scale systems. 1.
Pipelining with Futures
, 1997
"... Pipelining has been used in the design of many PRAM algorithms to reduce their asymptotic running time. Paul, Vishkin, and Wagener (PVW) used the approach in a parallel implementation of 2-3 trees. The approach was later used by Cole in the first O(lg n) time sorting algorithm on the PRAM not based ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Pipelining has been used in the design of many PRAM algorithms to reduce their asymptotic running time. Paul, Vishkin, and Wagener (PVW) used the approach in a parallel implementation of 2-3 trees. The approach was later used by Cole in the first O(lg n) time sorting algorithm on the PRAM not based on the AKS sorting network, and has since been used to improve the time of several other algorithms. Although the approach has improved the asymptotic time of many algorithms, there are two practical problems: maintaining the pipeline is quite complicated for the programmer, and the pipelining forces highly synchronous code execution. Synchronous execution is less practical on asynchronous machines and makes it difficult to modify a schedule to use less memory or to take better advantage of locality.
A framework for measuring supercomputer productivity
- The International Journal of High Performance Computing Applications, (18)4, Winter
, 2004
"... We propose a framework for measuring the productivity of high performance computing (HPC) systems, based on common economic definitions of productivity and on utility theory. We discuss how these definitions can capture essential aspects of HPC systems, such as the importance of time-to-solution and ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
We propose a framework for measuring the productivity of high performance computing (HPC) systems, based on common economic definitions of productivity and on utility theory. We discuss how these definitions can capture essential aspects of HPC systems, such as the importance of time-to-solution and the trade-off between programming time and execution time. Finally, we outline a research program that would lead to the definition of effective productivity metrics for HPC that fit within the proposed framework.
A consistent semantics of self-adjusting computation
, 2006
"... Abstract. This paper presents a semantics of self-adjusting computation and proves that the semantics is correct and consistent. The semantics integrates change propagation with the classic idea of memoization to enable reuse of computations under mutation to memory. During evaluation, reuse of a co ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. This paper presents a semantics of self-adjusting computation and proves that the semantics is correct and consistent. The semantics integrates change propagation with the classic idea of memoization to enable reuse of computations under mutation to memory. During evaluation, reuse of a computation via memoization triggers a change propagation that adjusts the reused computation to reflect the mutated memory. Since the semantics combines memoization and change-propagation, it involves both non-determinism and mutation. Our consistency theorem states that the non-determinism is not harmful: any two evaluations of the same program starting at the same state yield the same result. Our correctness theorem states that mutation is not harmful: self-adjusting programs are consistent with purely functional programming. We formalized the semantics and its meta-theory in the LF logical framework and machine-checked the proofs in Twelf. 1

