Results 1 - 10
of
19
Value Locality and Load Value Prediction
, 1996
"... Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently p ..."
Abstract
-
Cited by 331 (18 self)
- Add to MetaCart
Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict entire 32- and 64-bit register values based on previously-seen values. We find that, just as condition bits are fairly predictable on a per-static-branch basis, full register values being loaded from memory are frequently predictable as well. Furthermore, we show that simple microarchitectural enhancements to two modern microprocessor implementations (based on the PowerPC 620 and Alpha 21164) that enable load value prediction can effectively exploit value locality to collapse true dependencies, reduce average memory latency and bandwidth requirements, and provide measurable performance gains. 1. Introduction and Related
Reducing State Loss for Effective Trace Sampling of Superscalar Processors
- In Proceedings of the 1996 International Conference on Computer Design (ICCD
, 1996
"... There is a wealth of technological alternatives that can be incorporated into a processor design. These include reservation station designs, functional unit duplication, and processor branch handlingstrategies. The performance of a given design is measured through the execution of application progra ..."
Abstract
-
Cited by 88 (2 self)
- Add to MetaCart
There is a wealth of technological alternatives that can be incorporated into a processor design. These include reservation station designs, functional unit duplication, and processor branch handlingstrategies. The performance of a given design is measured through the execution of application programs and other workloads. Presently, trace-driven simulation is the most popular method of processor performance analysis in the development stage of system design. Current techniques of trace-driven simulation, however, are extremely slow and expensive. In this paper, a fast and accurate method for statistical trace sampling of superscalar processors is proposed. 1
Value Locality And Speculative Execution
, 1997
"... This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detectio ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
This thesis introduces a program attribute called value locality and proposes speculative execution under the weak dependence model. The weak dependence model lays a theoretical foundation for exploiting value locality and other program attributes by speculatively relaxing and deferring the detection and enforcement of control- and data-flow dependences between instructions to expose more instruction-level parallelism without violating program correctness. Value locality is a program attribute that describes the likelihood of the recurrence of a previously-seen value within a storage location inside a computer system. Most modern processors already exploit value locality through the use of control speculation (i.e. branch prediction), which seeks to predict the future values of condition code bits and branch-target addresses based on previously-seen values. Experimental results indicate that value locality exists for condition codes and branch target addresses, and for general-purpose ...
Calibration of Microprocessor Performance Models
, 1998
"... This paper outlines a method for calibrating a superscalar processor performance model. It is adapted from and integrates well with the existing tools and method used in industry for hardware functional validation. The goal of this study is to identify the risks of developing a performance model wit ..."
Abstract
-
Cited by 47 (5 self)
- Add to MetaCart
This paper outlines a method for calibrating a superscalar processor performance model. It is adapted from and integrates well with the existing tools and method used in industry for hardware functional validation. The goal of this study is to identify the risks of developing a performance model with inspection-based validation. It is shown that such a model can exhibit major latency and behavioral discrepancies against the actual hardware. A performance model constructed for the PowerPC 604 processor is shown to model the latency of only 50% of all instructions, and the pipeline flow behavior of only 30% of all instructions. Using our calibration method it is possible to very quickly identify enough model errors and improve the above percentages to 96% and 75% respectively. Three types of errors are documented and discussed: modeling errors, specification errors, and abstraction errors. Based on the calibration approach used in this study, we propose a systematic method for ensuring t...
MASE: A Novel Infrastructure for Detailed Microarchitectural Modeling
- in Proceedings of the 2001 International Symposium on Performance Analysis of Systems and Software
, 2001
"... MASE (Micro Architectural Simulation Environment) is a novel infrastructure that provides a flexible and capable environment to model modern microarchitectures. Many popular simulators, such as SimpleScalar, are predominately trace-based where the performance simulator is driven by a trace of instru ..."
Abstract
-
Cited by 45 (1 self)
- Add to MetaCart
MASE (Micro Architectural Simulation Environment) is a novel infrastructure that provides a flexible and capable environment to model modern microarchitectures. Many popular simulators, such as SimpleScalar, are predominately trace-based where the performance simulator is driven by a trace of instructions read from a file or generated on-the-fly by a functional simulator. Trace-driven simulators are well-suited for oracle studies and provide a clean division between performance modeling and functional emulation. A major problem with this approach, however, is that it does not accurately model timing dependent computations, an increasing trend in microarchitecture designs such as those found in multiprocessor systems. MASE implements a micro-functional performance model that combines timing and functional components into a single core. In addition, MASE incorporates a trace-driven functional component used to implement oracle studies and check the results of instructions as they commit. The check feature reduces the burden of correctness on the micro-functional core and also serves as a powerful debugging aid. MASE also implements a callback scheduling interface to support resources with non-deterministic latencies such as those found in highly concurrent memory systems. MASE was built on top of the current version of SimpleScalar. Analyses show that the performance statistics are comparable without a significant increase in simulation time.
A Worst Case Timing Analysis Technique for Multiple-Issue Machines
- In Proc. 19 th IEEE Real-Time Systems Symposium (RTSS'98
, 1998
"... We propose a worst case timing analysis technique for in-order, multiple-issue machines. In the proposed technique, timing information for each program construct is represented by a directed acyclic graph (DAG) that shows dependences among instructions in the program construct. From this information ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
We propose a worst case timing analysis technique for in-order, multiple-issue machines. In the proposed technique, timing information for each program construct is represented by a directed acyclic graph (DAG) that shows dependences among instructions in the program construct. From this information, we derive for each pair of instructions the distance bounds between their issue times. Using these distance bounds, we identify the sets of instructions that can be issued at the same time. Deciding such instructions is an essential task in reasoning about the timing behavior of multiple-issue machines. In order to reduce the complexity of analysis, the distance bounds are progressively refined through a hierarchical analysis over the program syntax tree in a bottom-up fashion. Our experimental results show that the proposed technique can predict the worst case execution times for in-order, multiple-issue machines as accurately as ones for simpler RISC processors.
Visualizing Application Behavior on Superscalar Processors
- IN INFOVIS
, 1999
"... The advent of superscalar processors with out-of-order execution makes it increasingly difficult to determine how well an application is utilizing the processor and how to adapt the application to improve its performance. In this paper, we describe a visualization system for the analysis of applicat ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
The advent of superscalar processors with out-of-order execution makes it increasingly difficult to determine how well an application is utilizing the processor and how to adapt the application to improve its performance. In this paper, we describe a visualization system for the analysis of application behavior on superscalar processors. Our system provides an overview-plus-detail display of the application's execution. A timeline view of pipeline performance data shows the overall utilization of the pipeline, indicating regions of poor instruction throughput. This information is displayed using multiple time scales, enabling the user to drill down from a high-level application overview to a focus region of hundreds of cycles. This region of interest is displayed in detail using an animated cycle-by-cycle view of the execution. This view shows how instructions are reordered and executed and how functional units are being utilized. Additional context views correlate instructions in this detailed view with the relevant source code for the application. This allows the user to discover the root cause of the poor pipeline utilization and make changes to the application to improve its performance. This
Using Visualization To Understand The Behavior Of Computer Systems
, 2001
"... As computer systems continue to grow rapidly in both complexity and scale, developers need tools to help them understand the behavior and performance of these systems. While information visualization is a promising technique, most existing computer systems visualizations have focused on very specifi ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
As computer systems continue to grow rapidly in both complexity and scale, developers need tools to help them understand the behavior and performance of these systems. While information visualization is a promising technique, most existing computer systems visualizations have focused on very specific problems and data sources, limiting their applicability. This dissertation introduces Rivet, a general-purpose environment for the development of computer systems visualizations. Rivet can be used for both real-time and post-mortem analyses of data from a wide variety of sources. The modular architecture of Rivet enables sophisticated visualizations to be assembled using simple building blocks representing the data, the visual representations, and the mappings between them. The implementation of Rivet enables the rapid prototyping of visualizations through a scripting language interface while still providing high-performance graphics and data management. The effectiveness of Rivet as a tool for computer systems analysis is demonstrated through a collection of case studies. Visualizations created using Rivet have been used to display: (a) line-by-line execution data from the SUIF Explorer interactive parallelizing compiler, enabling programmers to maximize the parallel speedups of their applications; (b) detailed memory system utilization data from the FlashPoint memory profiler, providing insights on both sequential and parallel program bottlenecks; (c) the behavior of applications running on superscalar processors, allowing developers to take full advantage of these complex CPUs; and (d) the real-time performance of computer systems and clusters, drawing attention to interesting or anomalous behavior. In addition to these focused examples, Rivet has been also used in co...
STATS: A Framework for Microprocessor and System-Level Design Space Exploration
- J. Syst. Arch
, 1999
"... As microprocessor-based systems grow in complexity, and the processor-memory speed gap widens further, more emphasis needs to be placed on early design space exploration in order to produce the highest performance systems with minimal schedule impact. We discuss the critical issues associated with a ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
As microprocessor-based systems grow in complexity, and the processor-memory speed gap widens further, more emphasis needs to be placed on early design space exploration in order to produce the highest performance systems with minimal schedule impact. We discuss the critical issues associated with architectural evaluation of complex microprocessor-based systems, and present a methodology for the comprehensive and semi-automatic evaluation of processor, cache hierarchy, system interconnect, and main memory architectural and technological alternatives. We discuss the implementation of the methodology, and describe how it can be used in early design space exploration. The unique aspects of the methodology are further illustrated through two architectural investigations performed using the toolset. 1 Introduction The design of commercial microprocessor computer systems typically begins with the comparative evaluation of architectural alternatives, resulting in architectural specification...

