Results 1 -
8 of
8
Performance Analysis of Distributed Applications using Automatic Classification of Communication Inefficiencies
, 2000
"... We present a technique for performance analysis that helps users understand the communication behavior of their message passing applications. Our method automatically classifies individual communication operations and it reveals the cause of communication inefficiencies in the application. This clas ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
We present a technique for performance analysis that helps users understand the communication behavior of their message passing applications. Our method automatically classifies individual communication operations and it reveals the cause of communication inefficiencies in the application. This classification allows the developer to focus quickly on the culprits of truly inefficient behavior, rather than manually foraging through massive amounts of performance data. Specifically, we trace the message operations of MPI applications and then classify each individual communication event using decision tree classification, a supervised learning technique. We train our decision tree using microbenchmarks that demonstrate both efficient and inefficient communication. Since our technique adapts to the target system's configuration through these microbenchmarks, we can simultaneously automate the performance analysis process and improve classification accuracy. Our experiments on four applications demonstrate that our technique can improve the accuracy of performance analysis, and dramatically reduce the amount of data that users must encounter.
Thread Migration and Communication Minimization in DSM Systems
- Proc. of the IEEE, Special Issue on Distributed Shared Memory
, 1999
"... Networks of workstations are characterized by dynamic resource capacities. Such environments can only be efficiently exploited by applications that are dynamically re-configurable. This paper explores mechanisms and policies that enable online reconfiguration of shared-memory applications through th ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Networks of workstations are characterized by dynamic resource capacities. Such environments can only be efficiently exploited by applications that are dynamically re-configurable. This paper explores mechanisms and policies that enable online reconfiguration of shared-memory applications through thread migration. We describe the design and preliminary performance of a DSM system that performs online re-mappings of threads to nodes based on sharing behavior. Our system obtains complete sharing information through a novel correlation-tracking phase that avoids the thread thrashing that characterizes previous approaches. This information is used to evaluate the communication required by a given thread mapping, and to predict the resulting performance. 1. Introduction Meta-computer environments can be characterized by distribution, heterogeneity, and changing resource capacities. Metacomputers consist of networks of machines, some of which might be shared memory multiprocessors. Distri...
Performance Measurements for Multithreaded Programs
, 1998
"... Multithreaded programming is an e#ective way to exploit concurrency, but it is di#cult to debug and tune a highly threaded program. This paper describes a performance tool called Tmon for monitoring, analyzing and tuning the performance of multithreaded programs. The performance tool has two novel f ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Multithreaded programming is an e#ective way to exploit concurrency, but it is di#cult to debug and tune a highly threaded program. This paper describes a performance tool called Tmon for monitoring, analyzing and tuning the performance of multithreaded programs. The performance tool has two novel features: it uses #thread waiting time" as a measure and constructs thread waiting graphs to show thread dependencies and thus performance bottlenecks, and it identi#es #semi-busy-waiting" points where CPU cycles are wasted in condition checking and context switching. We have implemented the Tmon tool and, as a case study,we have used it to measure and tune a heavily threaded #le system. We used four workloads to tune di#erent aspects of the #le system. Wewere able to improve the #le system bandwidth and throughput signi#cantly. In one case, we were able to improve the bandwidth bytwo orders of magnitude. 1 Introduction Multithreading is a powerful technique to exploit parallelism on multipr...
A High-Level Abstraction of Shared Accesses
- The 3rd Symposium on Operating System Design and Implementation
, 2000
"... ion of Shared Accesses Peter J. Keleher keleher@cs.umd.edu Department of Computer Science University of Maryland College Park, MD 20742 Key words: shared memory, DSM, programming libraries, update protocols We describe the design and use of the tape mechanism, a new high-level abstraction of ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
ion of Shared Accesses Peter J. Keleher keleher@cs.umd.edu Department of Computer Science University of Maryland College Park, MD 20742 Key words: shared memory, DSM, programming libraries, update protocols We describe the design and use of the tape mechanism, a new high-level abstraction of accesses to shared data for software DSMs. Tapes consolidate and generalize a number of recent protocol optimizations, including update-based locks and record-replay barriers. Tapes are usually created by "recording" shared accesses. The resulting recordings can be used to anticipate future accesses by tailoring data movement to application semantics. Tapes-based mechanisms are layered on top of existing shared memory protocols, and are largely independent of the underlying memory model. Tapes can also be used to emulate the data-movement semantics of several update-based protocol implementations, without altering the underlying protocol implementation. We have used tapes to create the Tape...
Prescriptive Performance Tuning: The RX Approach
, 1998
"... Programmers often rely on performance analysis tools to provide feedback about the execution of their applications. However, the nature of this feedback is far from satisfactory. Often the feedback is purely descriptive and at a very low--level, making it difficult for the programmer to rectify perf ..."
Abstract
- Add to MetaCart
Programmers often rely on performance analysis tools to provide feedback about the execution of their applications. However, the nature of this feedback is far from satisfactory. Often the feedback is purely descriptive and at a very low--level, making it difficult for the programmer to rectify performance problems. This dissertation demonstrates a new approach to performance tuning: prescriptive performance debugging. Our approach can greatly reduce the burdens imposed on the programmer compared to existing performance analysis tools. The basis of this approach is a set of requirements that must be satisfied by a performance analysis tool. In problem domains where these requirements can be met, a performance tool can prescribe source--level changes to improve performance. R x is one such tool that we have developed to improve the performance of explicitly parallel shared memory programs. R x targets inter--process synchronization and data communication, two significant sources of...
A Split Data Cache Organization Based on Run-Time Data Locality Estimation
, 2000
"... Cache memories are used extensively in modern computer organizations in order to reduce the performance gap between fast microprocessors and slower main memory. Cache memory hides the main memory access latency by exploiting the data locality present in pre-fetched memory blocks in the cache. Conven ..."
Abstract
- Add to MetaCart
Cache memories are used extensively in modern computer organizations in order to reduce the performance gap between fast microprocessors and slower main memory. Cache memory hides the main memory access latency by exploiting the data locality present in pre-fetched memory blocks in the cache. Conventional pre-fetching policies used in traditional cache organizations have the potential to waste the available cache bandwidth and space by bringing non-usable data in the cache. Conventional caches cannot meet the different sized storage requirements of data that exhibit spatial or temporal locality characteristics when their address spaces vary and are non-overlapping. Data characterized by spatial or temporal locality could be more efficiently accommodated if caches with different line sizes based on the locality type could be used. The fixed line size of a conventional cache restricts this efficiency. To reduce the performance bottleneck of conventional caches, an alternative cache organization is explored in this research. The SPEC92 benchmarks as well as other standard benchmark programs are used to observe run-time data locality. Based on the locality analysis, a simple locality prediction technique was designed in hardware capable of estimating the data locality bias of the cache-resident data during run-time. This prediction hardware is used to design a split data cache that uses two sub-caches; spatial and temporal cache. This organization stores data in the respective sub-caches based on the dynamic locality estimation during run-time of the executing programs. The split data cache organization showed a considerable performance increase over a conventional unified data cache by iv reducing the overall cache miss rate and bus data traffic. A better utilization of t...
Performance Tuning Software . . .
- JOURNAL OF SUPERCOMPUTING
"... Small organisations can now have access to high raw processing power using networks of workstations (NOW) as parallel computing pl atforms. Software Distributed Shared Memory (Software DSM) packages have been devel ped to facilq ate the programming of such systems. However, because of the high inter ..."
Abstract
- Add to MetaCart
Small organisations can now have access to high raw processing power using networks of workstations (NOW) as parallel computing pl atforms. Software Distributed Shared Memory (Software DSM) packages have been devel ped to facilq ate the programming of such systems. However, because of the high interprocessl atencies in a NOW, the performance of a software DSM appl ication is more susceptible to the partitioning of the problem than what might be expected. This paper presents an approach for a tool to visualise the execution of a program in a way that highl ights performance bottl enecks. The tool associates identified bottl enecks with the corresponding source codel ines in order to determine what piece of code is the cause of poor performance. The visual#R tion technique is demonstrated in two case studies. They cl ar l show that the visual isation is indeed useful and provides an effective way to acquire an understanding of what characterises an applications sharing behaviour.
Domain Decomposition of the Fourth-Order AGE Method on Heat Equation with MPI
"... A parallel implementation of the fourth-order Alternating Group Explicit (AGE-4) method on 1-D heat equation on a distributed computing environment through Message Passing Interface (MPI) is reported. The numerical method is implicit and is based on a splitting strategy which is applied alternately ..."
Abstract
- Add to MetaCart
A parallel implementation of the fourth-order Alternating Group Explicit (AGE-4) method on 1-D heat equation on a distributed computing environment through Message Passing Interface (MPI) is reported. The numerical method is implicit and is based on a splitting strategy which is applied alternately at each half time step. The parallelization of the program is implemented by a domain decomposition strategy on MIMD parallel architectures using MPI platform. The parallelization strategy and performance are discussed. It is concluded that the efficiency is strongly dependent on the grid size, block numbers and the number of processors. Different strategies to improve the computational efficiency are proposed.

