Results 1 - 10
of
13
Filtering Techniques to Improve Trace-Cache Efficiency
- in Proc. of the International Conference on Parallel Architectures and Compilation Techniques
, 2001
"... The trace cache is becoming an important building block of modern, wide-issue, processors. So far, trace cache related research has been focused on increasing fetch bandwidth. Trace-caches have been shown to effectively increase the number of “useful ” instructions that can be fetched into the machi ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
The trace cache is becoming an important building block of modern, wide-issue, processors. So far, trace cache related research has been focused on increasing fetch bandwidth. Trace-caches have been shown to effectively increase the number of “useful ” instructions that can be fetched into the machine, thus enabling more instructions to be executed each cycle. However, trace cache has another important benefit that got less attention in recent research: especially for variable length ISA, such as Intel’s IA-32 architecture (X86), reducing instruction decoding power is particularly attractive. Keeping the instruction traces in decoded format, implies the decoding power is only paid upon the build of a trace, thus reducing the overall power consumption of the
Parallelism in the Front-End
- in Proceedings of the 30th annual international symposium on Computer architecture
, 2003
"... As processor back-ends get more aggressive, front-ends will have to scale as well. Although the back-ends of superscalar processors have continued to become more parallel, the front-ends remain sequential. This paper describes techniques for fetching and renaming multiple non-contiguous portions of ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
As processor back-ends get more aggressive, front-ends will have to scale as well. Although the back-ends of superscalar processors have continued to become more parallel, the front-ends remain sequential. This paper describes techniques for fetching and renaming multiple non-contiguous portions of the dynamic instruction stream in parallel using multiple fetch and rename units. It demonstrates that parallel front-ends are a viable alternative to high-performance sequential front-ends.
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture
- In 2nd International Symposium on Code Generation and Optimization
, 2004
"... We study several major characteristics of dynamic optimization within the PARROT power-aware, trace-cachebased microarchitectural framework. We investigate the benefit of providing optimizations which although tightly coupled with the microarchitecture in substance are decoupled in time. The tight c ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We study several major characteristics of dynamic optimization within the PARROT power-aware, trace-cachebased microarchitectural framework. We investigate the benefit of providing optimizations which although tightly coupled with the microarchitecture in substance are decoupled in time. The tight coupling in substance provides the potential for tailoring optimizations for microarchitecture in a manner impossible or impractical not only for traditional static compilers but even for a JIT. We show that the contribution of common, generic optimizations to processor performance and energy efficiency may be more than doubled by creating a more intimate correlation between hardware specifics and the optimizer. In particular, dynamic optimizations can profit greatly from hardware supporting fused and SIMDified operations. At the same time, the decoupling in time allows optimizations to be arbitrarily aggressive without significant performance loss. We demonstrate that requiring up to 512 repetitions before a trace is optimized sacrifices almost no performance or efficiency as compared with lower thresholds. These results confirm the feasibility of energy efficient hardware implementation of an aggressive optimizer. 1.
Selecting long atomic traces for high coverage
- in Proceedings of the 17th International Conference on Supercomputing, 2003
, 2003
"... This paper performs a comprehensive investigation of dynamic selection for long atomic traces. It introduces a classification of trace selection methods and discusses existing and novel dynamic selection approaches – including loop unrolling, procedure inlining and incremental merging of traces base ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper performs a comprehensive investigation of dynamic selection for long atomic traces. It introduces a classification of trace selection methods and discusses existing and novel dynamic selection approaches – including loop unrolling, procedure inlining and incremental merging of traces based on dynamic bias. The paper empirically analyzes a number of selection schemes in an idealized framework. Observations based on the SPEC-CPU2000 benchmarks show that: (a) selection based on dynamic bias is necessary to achieve the best performance across all benchmarks, (b) the best selection scheme is benchmark and maximum trace-length specific, (c) simple selection, based on program structure information only, is sufficient to achieve the best performance for several benchmarks. Consequently, two alternatives for the trace selection mechanism are established: (a) a “best performance ” approach relying on complex dynamic criteria; (b) a “value ” approach that provides the best performance (and potentially the best power consumption) based on simpler static criteria. Another emerging alternative advocates adaptive based mechanisms to adjust selection criteria.
Finding parallelism for future epic machines
- in Proceedings of the 4th Workshop on Explicitly Parallel Instruction Computing Techniques
, 2005
"... Parallelism has been the primary architectural mechanism to increase computer system performance. To continue pushing the performance envelope, identifying new sources of parallelism for future architectures is critical. Current hardware exploits local instruction level parallelism (ILP) as hardware ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Parallelism has been the primary architectural mechanism to increase computer system performance. To continue pushing the performance envelope, identifying new sources of parallelism for future architectures is critical. Current hardware exploits local instruction level parallelism (ILP) as hardware resources and information communicated by the instruction-set architecture (ISA) permit. From this perspective, the Explicitly Parallel Instruction Computing (EPIC) ISA is an interesting model for future machines as its primary design allows software to expose analysis information to the underlying processor for it to exploit parallelism. In this effort, EPIC processors have been more or less successful, however the question of how to identify (or create) additional parallelism remains. This paper analyzes the potential of future relationships of compilers, ISAs, and hardware resources to collectively exploit new levels of parallelism. By experimentally studying the ILP of applications under ideal execution conditions (e.g., perfect memory disambiguation, infinite instruction-issue window, and infinite machine resources), the impact of aggressive compiler optimization and the underlying processor ISA on parallelism can be explored. Experimental comparisons involving an Itanium-based EPIC model and an Intel x86-based CISC (Complex Instruction Set Computing) model indicate that the compiler and certain ISA details directly affect local and distant instruction-level parallelism. The experimental results also suggest promising research directions for extracting the distant ILP. 1.
A Survey of prefetching techniques
, 2000
"... As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance. To reduce the bottleneck, designers have had to create methods to hide these latencies. One popular method is prefetching. This method fetches the data from the memo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance. To reduce the bottleneck, designers have had to create methods to hide these latencies. One popular method is prefetching. This method fetches the data from the memory system before being asked for by the processor, with the expectation that it will soon be referenced. An effective prefetching scheme reduces cache miss rates and therefore hides the memory latency. The aim of this paper is to provide a survey of hardware prefetching techniques. To achieve this goal, we provide a brief introduction to the concepts behind prefetching. An overview of software prefetching techniques is also given. We are then in a position to examine a number of instruction and data prefetching schemes that have previously been proposed. Keywords: prefetching,caches Computing Review Categories: B3.2,D3.4 1 Introduction Microprocessor speeds have increased dramatically ov...
Code Cache Management in Dynamic Optimization Systems
, 2004
"... Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional co ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Dynamic optimization systems store optimized or translated code in software-managed code caches in order to maximize reuse of transformed code. Code caches store superblocks that are not fixed in size, may contain links to other superblocks, and carry a high replacement overhead. These additional constraints reduce the effectiveness of conventional cache management policies. This dissertation investigates the code cache management problem in dynamic optimization systems and presents three major advances that cover the design space of cache management decisions. Through code cache simulations, we show that a FIFO replacement policy outperforms other traditional policies, as it enables contiguous cache evictions, allows for a simple circular buffer implementation, and results in comparable cache miss rates to LRU. Furthermore, a pseudo-circular FIFO algorithm is presented, which handles the problem of un-deletable cache blocks. An investigation of cache eviction granularities also reveals that evicting more than the minimum number of superblocks from the code cache at a time results in
A Survey of prefetching techniques
, 2000
"... As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance. To reduce the bottleneck, designers have had to create methods to hide these latencies. One popular method is prefetching. This method fetches the data from the memo ..."
Abstract
- Add to MetaCart
As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance. To reduce the bottleneck, designers have had to create methods to hide these latencies. One popular method is prefetching. This method fetches the data from the memory system before being asked for by the processor, with the expectation that it will soon be referenced. An effective prefetching scheme reduces cache miss rates and therefore hides the memory latency. The aim of this paper is to provide a survey of hardware prefetching techniques. To achieve this goal, we provide a brief introduction to the concepts behind prefetching. An overview of software prefetching techniques is also given. We are then in a position to examine a number of instruction and data prefetching schemes that have previously been proposed.
On Augmenting Trace Cache for High-Bandwidth Value Prediction
, 2002
"... Value prediction is a technique that breaks true data dependences by predicting the outcome of an instruction and speculatively executes its data-dependent instructions based on the predicted outcome. As the instruction fetch rate and issue rate of processors increase, the potential data dependenc ..."
Abstract
- Add to MetaCart
Value prediction is a technique that breaks true data dependences by predicting the outcome of an instruction and speculatively executes its data-dependent instructions based on the predicted outcome. As the instruction fetch rate and issue rate of processors increase, the potential data dependences among instructions issued in the same cycle also increase. Value prediction and speculative execution become critical to keep the issue rate high. Unfortunately, most of the proposed value prediction schemes focused only on the accuracy of the prediction. They have yet to consider the bandwidth required to access the value prediction tables.
Visualizing Potential Parallelism in Sequential Programs
"... This paper presents ParaMeter, an interactive program analysis and visualization system for large traces. Using ParaMeter, a software developer can locate and analyze regions of code that may yield to parallelization efforts and to possibly extract performance from multicore hardware. The key contri ..."
Abstract
- Add to MetaCart
This paper presents ParaMeter, an interactive program analysis and visualization system for large traces. Using ParaMeter, a software developer can locate and analyze regions of code that may yield to parallelization efforts and to possibly extract performance from multicore hardware. The key contributions in the paper are (1) a method to use interactive visualization of traces to find and exploit parallelism, (2) interactive-speed visualization of large-scale trace dependencies, (3) interactive-speed visualization of code interactions, and (4) a BDD variable ordering for BDD-compressed traces that results in fast visualization, fast analysis, and good compression. ParaMeter’s effectiveness is demonstrated by finding and exploiting parallelism in 175.vpr. Measurements of ParaMeter’s visualization algorithms show that they are up to seventy-five thousand times faster than prior approaches. 1.

