Results 1 - 10
of
21
Microarchitectural Exploration with Liberty
- IN PROCEEDINGS OF THE 35TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 2002
"... To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Unfortunately, the most prevalent simulator construction methodology, hand-writing monolithic simulators in sequential programming languages, yields simulators that are hard to ret ..."
Abstract
-
Cited by 80 (27 self)
- Add to MetaCart
To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Unfortunately, the most prevalent simulator construction methodology, hand-writing monolithic simulators in sequential programming languages, yields simulators that are hard to retarget, limiting the number of designs explored, and hard to understand, instilling little confidence in the model. Simulator construction tools have been developed to address these problems, but analysis reveals that they do not address the root cause, the error-prone mapping between the concurrent, structural hardware domain and the sequential, functional software domain. This paper presents an analysis of these problems and their solution, the Liberty Simulation Environment (LSE). LSE automatically constructs a simulator from a machine description that closely resembles the hardware, ensuring fidelity in the model. Furthermore, through a strict but general component communication contract, LSE enables the creation of highly reusable component libraries, easing the task of rapidly exploring ever more exotic designs.
Frequent Value Compression in Data Caches
, 2000
"... Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a rst level cache that has been designed so that each cache line can either hold one un ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a rst level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths. We use a novel data compression scheme based upon encoding of a small number of values that appear frequently during memory accesses. This compression scheme preserves the ability to randomly access individual data items. We observed that the contents of 40%, 52% and 51% of the memory blocks of size 4, 8, and 16 words respectively in SPECint95 benchmarks can be compressed to at least half their sizes by encoding the top 2, 4, and 8 frequent values respectively. Compression allows greater amounts of data to be stored leading to substantial reductions in miss rates (0-36.4%), o-chip trac (3.948. 1%)...
Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation
- Design, Automation and Test in Europe Conference and Exhibition
, 2003
"... Given the growth in application-specific processors, there is a strong need for a retargetable modeling framework that is capable of accurately capturing complex processor behaviors and generating efficient simulators. We propose the operation state machine (OSM) computation model to serve as the fo ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Given the growth in application-specific processors, there is a strong need for a retargetable modeling framework that is capable of accurately capturing complex processor behaviors and generating efficient simulators. We propose the operation state machine (OSM) computation model to serve as the foundation of such a modeling framework. The OSM model separates the processor into two interacting layers: the operation layer where operation semantics and timing are modeled, and the hardware layer where disciplined hardware units interact. This declarative model allows for direct synthesis of micro-architecture simulators as it encapsulates precise concurrency semantics of microprocessors. We illustrate the practical benefits of this model through two case studies- the StrongARM core and the PowerPC-750 superscalar processor. The experimental results demonstrate that the OSM model has excellent modeling productivity and model efficiency. Additional applications of this modeling framework include derivation of information required by compilers and formal analysis for processor validation.
Optimizing Static Power Dissipation by Functional Units in Superscalar Processors
, 2002
"... We present a novel approach which combines compiler, instruction set, and microarchitecture support to turn off functional units that are idle for long periods of time for reducing static power dissipation by idle functional units using power gating [2, 9]. The compiler identies program regions ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
We present a novel approach which combines compiler, instruction set, and microarchitecture support to turn off functional units that are idle for long periods of time for reducing static power dissipation by idle functional units using power gating [2, 9]. The compiler identies program regions in which functional units are expected to be idle and communicates this information to the hardware by issuing directives for turning units off at entry points of idle regions and directives for turning them back on at exits from such regions. The microarchitecture is designed to treat the compiler directives as hints ignoring a pair of off and on directives if they are too close together. The results of experiments show that some of the functional units can be kept off for over 90% of the time at the cost of minimal performance degradation of under 1%.
Superscalar Execution With Dynamic Data Forwarding
- In Proceedings of the 1998 ACM/IEEE Conference on Parallel Architectures and Compilation Techniques
, 1998
"... We empirically demonstrate that in order to take advantage of increasing issue widths, superscalar processors require quadratically growing instruction window sizes. Since conventional central window design aims to provide full data fan-out to all the instructions which are in the window, designing ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
We empirically demonstrate that in order to take advantage of increasing issue widths, superscalar processors require quadratically growing instruction window sizes. Since conventional central window design aims to provide full data fan-out to all the instructions which are in the window, designing large instruction windows using conventional techniques is not feasible. We show that full data fan-out is not necessary for achieving high performance when a novel approach is used to distribute the values. We use direct matching using a small on chip memory called the wait memory to implement the instruction window and bring in a small subset of instructions which are likely to become ready into a match unit where instruction selection and operand matching tasks are performed. We show that the match unit needs to grow only linearly with the issue width. We use SPEC95 benchmarks to demonstrate that at a given instruction window size our algorithm provides over 90 percent of the IPC that ca...
Energy Efficient Frequent Value Data Cache Design
- Int. Symp. on Microarchitecture
, 2002
"... Recent work has shown that a small number of distinct frequently occurring values often account for a large portion of memory accesses. In this paper we demonstrate how this frequent value phenomenon can be exploited in designing a cache that trades off performance with energy efficiency. We propose ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
Recent work has shown that a small number of distinct frequently occurring values often account for a large portion of memory accesses. In this paper we demonstrate how this frequent value phenomenon can be exploited in designing a cache that trades off performance with energy efficiency. We propose the design of the Frequent Value Cache (FVC) in which storing a frequent value requires few bits as they are stored in encoded form while all other values are stored in unencoded form using 32 bits. The data array is partitioned into two arrays such that if a frequent value is accessed only the first data array is accessed; otherwise an additional cycle is needed to access the second data array. Experiments with some of the SPEC95 benchmarks show that on an average a 64Kb/64-value FVC provides 28.8% reduction in L1 cache energy and 3.38% increase in execution time delay over a conventional 64Kb cache.
Instruction based memory distance analysis and its application to optimization
- In Proceedings of the 14 th International Conference on Parallel Architectures and Compilation
, 2005
"... Feedback-directed Optimization has become an increasingly important tool in designing and building optimizing compilers as it provides a means to analyze complex program behavior that is not possible using traditional static analysis. Feedback-directed optimization offers the compiler opportunities ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Feedback-directed Optimization has become an increasingly important tool in designing and building optimizing compilers as it provides a means to analyze complex program behavior that is not possible using traditional static analysis. Feedback-directed optimization offers the compiler opportunities to analyze and optimize the memory behavior of programs even when traditional array-based analysis not applicable. As a result, both floatingpoint and integer programs can memory hierarchy optimization. In this we examine the notion of memory distance as it is applied to the instruction space of a program and to directed optimization. Memory distance is dejined as a dynamic distance in terms of memory references between two accesses to the same memory location. We use memory distance to predict the miss rates of instructions in a program. Using the miss rates, we then identifi the program’s critical instructions-set of high miss instructions whose cumulative misses account for 95 % of the L2 cache misses in the program-in both integer andfloating-point pmgrams. Our experimentsshow that distance analysis can effectively identifi critical instructions in both integer programs. Additionally, we apply memory-distance analysis to memory disambiguation in out-of-order issue processors, using those distances to determinewhen a load may be speculated ahead of apreceding store. Our experiments show that memory-distance-based disambiguation on average achieves within of the performance gain of the store set technique which requires hardware table. 1.
The Liberty Simulation Environment: A deliberate approach to high-level system modeling
- ACM Transactions on Computer Systems
, 2004
"... In digital hardware system design, the quality of the product is directly related to the number of meaningful design alternatives properly considered. Unfortunately, existing modeling methodologies and tools have properties which make them less than ideal for rapid and accurate designspace explorati ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
In digital hardware system design, the quality of the product is directly related to the number of meaningful design alternatives properly considered. Unfortunately, existing modeling methodologies and tools have properties which make them less than ideal for rapid and accurate designspace exploration. This article identifies and evaluates the shortcomings of existing methods to motivate the Liberty Simulation Environment (LSE). LSE is a high-level modeling tool engineered to address these limitations, allowing for the rapid construction of accurate high-level simulation models. LSE simplifies model specification with low-overhead component-based reuse techniques and an abstraction for timing control. As part of a detailed description of LSE, this article presents these features, their impact on model specification effort, their implementation, and optimizations created to mitigate their otherwise deleterious impact on simulator execution
Microarchitecture Modeling for Design-Space Exploration Design-Space Exploration
, 2004
"... To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodolog ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this design-space exploration because existing simulator construction methodologies lead to long simulator development times. This bottleneck limits exploration to a small set of designs, potentially diminishing quality of the final design.
Load Redundancy Removal through Instruction Reuse
- In International Conference on Parallel Processing
, 2000
"... Instruction reuse techniques have been developed to detect and remove redundancy at runtime. By maintaining the execution history of an instruction, reuse techniques detect if a subsequent execution of an instruction will yield the same result as its previous execution, and if this is the case, the ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Instruction reuse techniques have been developed to detect and remove redundancy at runtime. By maintaining the execution history of an instruction, reuse techniques detect if a subsequent execution of an instruction will yield the same result as its previous execution, and if this is the case, the result is made available to dependent instructions without executing the instruction. This approach eliminates same instruction redundancy, that is, redundancy across different dynamic instances of the same static instruction. However, the main limitation of existing instruction reuse techniques is that they do not detect or eliminate different instruction redundancy, that is, redundancy across dynamic instances of statically distinct instructions. We present instruction reuse techniques for load redundancy removal that eliminate both same and different instruction redundancy. We first present a study that shows that in addition to significant levels of same instruction redundancy (average of ...

