Results 1 -
7 of
7
Analysis of Benchmark Characteristics and Benchmark Performance Prediction
- ACM Transactions on Computer Systems
, 1992
"... Standard benchmarking provides the run times for given programs on given machines, but fails to provide insight as to why those results were obtained (either in terms of machine or program characteristics), and fails to provide run times for that program on some other machine, or some other programs ..."
Abstract
-
Cited by 99 (4 self)
- Add to MetaCart
Standard benchmarking provides the run times for given programs on given machines, but fails to provide insight as to why those results were obtained (either in terms of machine or program characteristics), and fails to provide run times for that program on some other machine, or some other programs on that machine. We have developed a machineindependent model of program execution to characterize both machine performance and program execution. By merging these machine and program characterizations, we can estimate execution time for arbitrary machine/program combinations. Our technique allows us to identify those operations, either on the machine or in the programs, which dominate the benchmark results. This information helps designers in improving the performance of future machines, and users in tuning their applications to better utilize the performance of existing machines. Here we apply our methodology to characterize benchmarks and predict their execution times. We present extensi...
A cost framework for evaluating integrated restructuring optimizations
- Int’l. Conf. on Parallel Architectures and Compilation Techniques
, 2001
"... Loop transformations and array restructuring optimizations usually improve performance by increasing the memory locality of applications, but not always. For instance, loop and array restructuring can either complement or compete with one another. Previous research has proposed integrating loop and ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Loop transformations and array restructuring optimizations usually improve performance by increasing the memory locality of applications, but not always. For instance, loop and array restructuring can either complement or compete with one another. Previous research has proposed integrating loop and array restructuring, but there existed no analytic framework for determining how best to combine the optimizations for a given program. Since the choice of which optimizations to apply, alone or in combination, is highly application- and inputdependent, a cost framework is needed if integrated restructuring is to be automated by an optimizing compiler. To this end, we develop a cost model that considers standard loop optimizations along with two potential forms of array restructuring: conventional copying-based restructuring and remapping-based restructuring that exploits a smart memory controller. We simulate eight applications on a variety of input sizes and with a variety of handapplied restructuring optimizations. We find that employing a fixed strategy does not always deliver the best performance. Finally, our cost model accurately predicts the best combination of restructuring optimizations among those we examine, and yields performance within a geometric mean of 5 % of the best combination across all benchmarks and input sizes. 1
MAD Kernels: An Experimental Testbed to Study Multiprocessor Memory System Behavior
, 1992
"... On large-scale multiprocessors, access to common memory is one of the key performance limiting factors. The shared-memory performance depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between th ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
On large-scale multiprocessors, access to common memory is one of the key performance limiting factors. The shared-memory performance depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between the two. We present a technique for multiprocessor workload construction and a family of artificial kernels, called MAD-kernels, to systematically investigate the behavior of the memory hierarchy. The measured performance is independent of any particular application or algorithm. The proposed methodology is demonstrated on two commercial shared-memory systems. Keywords: Performance evaluation, shared-memory multiprocessors, memory hierarchy, interconnection networks, resource contention, synchronization overhead, memory access patterns, unit grain characterization. This work was supported in part by NSF grants ECS-88-14027, MIP-88-11815, CDA-9121641 and MIP-9204066, and DOE grant DE-FG02-93...
A Framework for Multiprocessor Performance Characterization and Calibration
, 1992
"... A Framework for Multiprocessor Performance Characterization and Calibration By Arun K. Nanda In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache m ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A Framework for Multiprocessor Performance Characterization and Calibration By Arun K. Nanda In parallel programs using the shared-variable paradigm, run-time communication overhead manifests itself along three principal dimensions, namely, shared data accesses (including memory contention, cache misses and non-local memory access latencies), inter-process synchronization operations, and global barrier synchronizations. Performance measurements to quantify the rate at which communication costs for an algorithm increases as more processors are used is integral to the study of an algorithm's efficiency and scalability. In this thesis, we explore the problem of performance characterization of a multiprocessor in the context of the shared-variable programming model with emphasis on characterizing the dynamic run-time behavior. We have developed a hierarchical model to characterize multiprocessor system performance using a multi-phase computation structure with concurrent asynchronous exec...
ThreadMarks: A Framework for Input-Aware Prediction of Parallel Application Behavior
"... Chip-multiprocessors (CMPs) are quickly becoming entrenched as the main-stream architectural platform in computer systems. One of the critical challenges facing CMPs is designing applications to effectively leverage the computational resources they provide. Modifying applications to effectively run ..."
Abstract
- Add to MetaCart
Chip-multiprocessors (CMPs) are quickly becoming entrenched as the main-stream architectural platform in computer systems. One of the critical challenges facing CMPs is designing applications to effectively leverage the computational resources they provide. Modifying applications to effectively run on CMPs requires understanding the bottlenecks in applications, which necessitates a detailed understanding of architectural features. Unfortunately, identifying bottlenecks is complex and often requires enumerating a wide range of behaviors. To assist in identifying bottlenecks, this paper presents a framework for developing analytical models based on dynamic program behaviors. That is, given a program and set of training inputs, the framework will generate several analytical models that accurately predict online program behaviors such as memory utilization and synchronization overhead, while taking program input into consideration. These models can prove invaluable for online optimization systems and input-specific analysis of program behavior. We demonstrate that this framework is practical and accurate on a wide range of synthetic and real-world parallel applications over various workloads. 1
Cost-Model Driven Integration of Restructuring Optimizations
- IN THE 2001 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES
, 2001
"... Loop transformation and array restructuring are important compiler optimizations that improve memory locality in complementary ways. Although previous researchers have proposed integrating the two techniques, there exists no analytical framework for determining how best to combine them for a given p ..."
Abstract
- Add to MetaCart
Loop transformation and array restructuring are important compiler optimizations that improve memory locality in complementary ways. Although previous researchers have proposed integrating the two techniques, there exists no analytical framework for determining how best to combine them for a given program. In this paper, we propose a cost model for choosing between all combinations of loop and array restructuring options for a given loop nest. Since the choice of which optimization to apply, alone or in combination, is highly application and/or input dependent, such a cost model is crucial if integrated restructuring is to be automated by an optimizing compiler. Our cost model considers two potential forms of array restructuring: conventional copying-based restructuring and remapping-based restructuring that exploits a smart memory controller. We simulate six benchmark programs on a variety of input sizes and with a variety of restructuring optimizations. We find that employing a fixed strategy, e.g., only loop transformations or only copying-based restructuring, does not always deliver the best performance. We further find that for the benchmarks we examine, our cost-model chooses the best combination of restructuring optimizations the vast majority of the time, and yields performance within an average of 10 of optimal across all benchmarks and input sizes.

