Results 1 - 10
of
21
Home-based Shared Virtual Memory
, 1998
"... In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when pe ..."
Abstract
-
Cited by 51 (4 self)
- Add to MetaCart
In this dissertation, I investigate how to improve the performance of shared virtual memory (SVM) by examining consistency models, protocols, hardware support and applications. The main conclusion of this research is that the performance of shared virtual memory can be significantly improved when performance-enhancing techniques from all these areas are combined. This dissertation proposes home-based lazy release consistency as a simple, effective, and scalable way to build shared virtual memory systems. In home-based protocols each shared page has a home to which all writes are propagated and from which all copies are derived. Two home-based protocols are described, implemented and evaluated on two hardware and software platforms: Automatic Update Release Consistency (AURC), which requires hardware support for fine-grained remote writes (automatic updates), and Homebased Lazy Release Consistency (HLRC), which is implemented exclusively in software. The dissertation investigates the ...
Shared Virtual Memory: Progress and Challenges
- Proceedings of the IEEE
, 1999
"... This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework. The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency model ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
This paper is a survey of the first 12 years of research in SVM, placing the multi-track flow of ideas and results obtained so far in a comprehensive framework. The contributions indicated in Figure 1 are classified in four categories, each belonging primarily to one layer: relaxed consistency models, protocol laziness, architectural support, and applications and application-driven research. A section of the paper is devoted to each category. The last section discusses other important emerging issues related to SVM: the alternative of fine-grained software coherence, hybrid protocols that implement software shared memory across multiple hardware-coherent multiprocessors, and scalability. The paper summarizes comparative performance results from the literature, discusses their limitations, places existing protocols in a framework based on laziness, and identifies the lessons learned so far and some key outstanding questions.
Data Prefetching for Software DSMs
- In Proceedings of the 1998 International Conference on Supercomputing
, 1998
"... In this paper we propose and evaluate the Adaptive++ technique, a novel runtime-only data prefetching strategy for software-based distributed shared-memory systems (software DSMs). Adaptive++ improves the performance of regular parallel applications running on software DSMs by using the past history ..."
Abstract
-
Cited by 28 (2 self)
- Add to MetaCart
In this paper we propose and evaluate the Adaptive++ technique, a novel runtime-only data prefetching strategy for software-based distributed shared-memory systems (software DSMs). Adaptive++ improves the performance of regular parallel applications running on software DSMs by using the past history of memory access faults to adapt between repeated-phase and repeated-stride prefetching modes. Adaptive++ does not issue prefetches during periods when the application is not exhibiting one of these two types of behavior and is thus behaving irregularly. Through detailed execution-driven simulations of several applications, we show that our prefetching technique is very successful at reducing the data access overheads of regular applications running on the TreadMarks software DSM. Adaptive++ also reduces the overhead of applications that are not strictly regular but that exhibit periods of regularity. In terms of overall performance, our results show that Adaptive++ can provide speedup impr...
Helper Threads via Virtual Multithreading On An Experimental Itanium 2 Processor-based Platform
- In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2004
"... Helper threading is a technology to accelerate a program by exploiting a processor’s multithreading capability to run “assist ” threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data access ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
Helper threading is a technology to accelerate a program by exploiting a processor’s multithreading capability to run “assist ” threads. Previous experiments on hyper-threaded processors have demonstrated significant speedups by using helper threads to prefetch hard-to-predict delinquent data accesses. In order to apply this technique to processors that do not have built-in hardware support for multithreading, we introduce virtual multithreading (VMT), a novel form of switch-on-event user-level multithreading, capable of fly-weight multiplexing of event-driven thread executions on a single processor without additional operating system support. The compiler plays a key role in minimizing synchronization cost by judiciously partitioning register usage among the user-level threads. The VMT approach makes it possible to launch dynamic helper thread instances in response
Software-Controlled Multithreading Using Informing Memory Operations
- In International Symposium on High-Performance Computer Architecture
, 1998
"... Memory latency is becoming an increasingly important performance bottleneck, especially in multiprocessors. One technique for tolerating memory latency is multithreading, whereby we switch between threads upon expensive cache misses. In contrast with previous work on multithreading, we explore a new ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Memory latency is becoming an increasingly important performance bottleneck, especially in multiprocessors. One technique for tolerating memory latency is multithreading, whereby we switch between threads upon expensive cache misses. In contrast with previous work on multithreading, we explore a new approach that is software-controlled rather than hardware-controlled. To implement software-controlled multithreading, we use informing memory operations to quickly trap upon cache misses to a miss handler which performs the actual thread switching in software. Our experimental results demonstrate that software-controlled multithreading can result in significant performance gains on a shared-memory multiprocessor, with the majority of applications speeding up by 10% or more, and one application speeding up by 16%. In addition, we find that by selectively applying a register partitioning optimization to reduce the thread-switching overhead, we can increase the overall speedups to as much as ...
Optimizing OpenMP Programs on Software Distributed Shared Memory Systems
- International Journal of Parallel Programming
, 2003
"... This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques.
Combining Compile-Time and Run-Time Support for Efficient Software Distributed Shared Memory
- In Proceedings of IEEE, Special Issue on Distributed Shared Memory
, 1999
"... We describe an integrated compile-time and run-time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model, with its well-known benefits in terms of ease of use. The run-time system implements ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
We describe an integrated compile-time and run-time system for efficient shared memory parallel computing on distributed memory machines. The combined system presents the user with a shared memory programming model, with its well-known benefits in terms of ease of use. The run-time system implements a consistent shared memory abstraction using memory access detection and automatic data caching. The compiler improves the efficiency of the shared memory implementation by directing the runtime system to exploit the message passing capabilities of the underlying hardware. To do so, the compiler analyzes shared memory accesses, and transforms the code to insert calls to the run-time system that provide it with the access information computed by the compiler. The run-time system is augmented with the appropriate entry points to use this information to implement bulk data transfer and to reduce the overhead of run-time consistency maintenance. In those cases where the compiler analysis succee...
Responsiveness without Interrupts
, 1999
"... this paper is a characterization of the delays actually observed in a suite of applications. We show that the majority of notification delays result from a small number of large delays. These delays can dominate any gains achieved through use of new network technologies. The impact of these delays c ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
this paper is a characterization of the delays actually observed in a suite of applications. We show that the majority of notification delays result from a small number of large delays. These delays can dominate any gains achieved through use of new network technologies. The impact of these delays can be considerable. Our applications averaged more than 31% slower without interrupts than with them. This result argues that the problem is serious, and needs to be addressed either by including interrupts in emerging standards, or through use of the techniques discussed below
Active Correlation Tracking
- The 19th International Conference on Distributed Computing Systems
, 1999
"... We describe methods of identifying and exploiting sharing patterns in multi-threaded DSM applications. Active correlation tracking is used to determine the affinity, or amount of sharing, in pairs of threads. Thread affinities are combined to create correlation maps, which summarize sharing between ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We describe methods of identifying and exploiting sharing patterns in multi-threaded DSM applications. Active correlation tracking is used to determine the affinity, or amount of sharing, in pairs of threads. Thread affinities are combined to create correlation maps, which summarize sharing between all pairs of threads in the application.
Improving performance of OpenMP for SMP clusters through overlapped page migrations
- International Workshop on OpenMP (IWOMP’06
, 2006
"... Abstract. Costly page migration is a major obstacle to integrating OpenMP and page-based software distributed shared memory (SDSM) to realize the easy-touse programming paradigm for SMP clusters. To reduce the impact of the page migration overhead on the execution time of an application, the previou ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract. Costly page migration is a major obstacle to integrating OpenMP and page-based software distributed shared memory (SDSM) to realize the easy-touse programming paradigm for SMP clusters. To reduce the impact of the page migration overhead on the execution time of an application, the previous researches have mainly focused on reducing the number of page migrations and hiding the page migration overhead by overlapping computation and communication. We propose the ‘collective-prefetch ’ technique, which overlaps page migrations themselves even when the prior approach cannot be effectively applied. Experiments with a communication-intensive application show that our technique reduces the page migration overhead significantly, and the overall execution time was reduced to 57%~79%. 1

