Results 11 - 20
of
39
Generational Cache Management of Code Traces in Dynamic Optimization Systems
, 2003
"... A dynamic optimizer is a runtime software system that groups a program's instruction sequences into traces, optimizes those traces, stores the optimized traces in a softwarebased code cache, and then executes the optimized code in the code cache. To maximize performance, the vast majority of the pro ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
A dynamic optimizer is a runtime software system that groups a program's instruction sequences into traces, optimizes those traces, stores the optimized traces in a softwarebased code cache, and then executes the optimized code in the code cache. To maximize performance, the vast majority of the program's execution should occur in the code cache and not in the different aspects of the dynamic optimization system. In the past, designers of dynamic optimizers have used the SPEC2000 benchmark suite to justify their use of simple code cache management schemes. In this paper, we show that the problem and importance of code cache management changes dramatically as we move from SPEC2000, with its relatively small number of dynamically generated code traces, to large interactive Windows applications. We also propose and evaluate a new cache management algorithm based on generational code caches that results in an average miss rate reduction of 18% over a unified cache, which translates into 19% fewer instructions spent in the dynamic optimizer. The algorithm categorizes code traces based on their expected lifetimes and groups traces with similar lifetimes together in separate storage areas. Using this algorithm, short-lived code traces can easily be removed from a code cache without introducing fragmentation and without suffering the performance penalties associated with evicting long-lived code traces.
A Study of Memory System Performance of Multimedia Applications
- in Proceedings of the ACM Joint International Conference on Measurement & Modeling of Computer Systems (SIGMETRICS 2001
, 2001
"... Multimedia applications are fast becoming one of the domi-nating workloads for modern computer systems. Since these applications normally have large data sets and little data-reuse, many researchers believe that they have poor memory behavior compared to traditional programs, and that current cache ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Multimedia applications are fast becoming one of the domi-nating workloads for modern computer systems. Since these applications normally have large data sets and little data-reuse, many researchers believe that they have poor memory behavior compared to traditional programs, and that current cache architectures cannot handle them well. It is there-fore important to quantitatively characterize the memory be-havior of these applications in order to provide insights for future design and research of memory systems. However, very few results on this topic have been published. This pa-per presents a comprehensive research on the memory re-quirements of a group of programs that are representative of multimedia applications. These programs include a sub-set of the popular MediaBench suite and several large mul-timedia programs running on the Linux, Windows NT and Tru UNIX operating systems. We performed extensive mea-surement and trace-driven simulation experiments. We then compared the memory utilization of these programs to that of SPECint95 applications. We found that multimedia applica-tions actually have better memory behavior than SPECint95 programs. The high cache hit rates of multimedia applica-tions can be contributed to the following three factors. Most multimedia applications apply block partitioning algorithms to the input data, and work on small blocks of data that eas-ily fit into the cache. Secondly, within these blocks, there is significant data reuse as well as spatial locality. The third reason is that a large number of references generated by multimedia applications are to their internal data struc-tures, which are relatively small and can also easily fit into reasonably-sized caches. 1.
Performance and Memory-Access Characterization of Data Mining Applications
- WORKSHOP HELD IN CONJUNCTION WITH THE 31ST ANNUAL INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1998
"... This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that ex ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
This paper characterizes the performance and memory-access behavior of a decision tree induction program, a previously unstudied application used in data mining and knowledge discovery in databases. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. Several properties of the program are noted. Out-of-order dispatch and multiple issue provide a significant performance advantage: 50%--250% improvement in IPC for out-of-order dispatch versus in-order dispatch, and 5%-- 120% improvement in IPC for four-way issue versus single issue. Multiple issue provides a greater performance improvement for larger L2 cache sizes, when the program is limited by CPU performance; out-of-order dispatch provides a greater performance improvement for smaller L2 cache sizes. The program has a very small instruction footprint: for an 8-kB L1 instruction cache the instruction miss rate is below 0.1%. A small (8 kB) L1 data cache is sufficient to capture most of the locality of the data references, resulting in L1 miss rates between 10%--20%. Increasing the size of the L2 data cache does not significantly improve performance until a significant fraction (over 1/4) of the dataset fits into the L2 cache. Lastly, a procedure is developed for scaling the cache sizes when using scaled-down datasets, allowing the results for smaller datasets to be used to predict the performance of full-sized datasets.
Access Pattern based Local Memory Customization for Low Power Embedded Systems
- In DATE
, 2001
"... Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, the local memory relied on a large cache to store all the variables in the application. However, especially in large real-life applications, different types of data exhibit divergent types of local ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, the local memory relied on a large cache to store all the variables in the application. However, especially in large real-life applications, different types of data exhibit divergent types of locality and access patterns, with diverse locality and bandwidth needs. Traditional caches had to compromise between the different types of locality required by the access patterns, and trade-off performance against bandwidth requirement. Instead, our approach customizes the local memory architecture matching the diverse access patterns and locality types present in the application, to reduce the main memory bandwidth requirement, and significantly improve power consumption, without sacrificing performance. Our approach generated an average 30% memory power reduction without degrading performance on a set of large multimedia/general purpose applications and scientific kernels, over the best traditional cache configuration of similar size, demonstrating the utility of our algorithm.
Tracing and Characterization of NT-based System Workloads
- Digital Technical Journal
, 1998
"... Trace-driven simulation is commonly used by the computer architecture research community to pursue answers to a wide variety of architectural design issues. Traces taken from benchmark execution (e.g., SPEC, Bytemark, SPLASH) have been studied extensively to optimize the design of pipelines, branch ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Trace-driven simulation is commonly used by the computer architecture research community to pursue answers to a wide variety of architectural design issues. Traces taken from benchmark execution (e.g., SPEC, Bytemark, SPLASH) have been studied extensively to optimize the design of pipelines, branch predictors, and especially cache memories. Today's computer designs have been optimized based on the characteristics of these benchmarks. As applications become more dependent on services and APIs provided by the hosting operating system, the overall application performance becomes more dependent on efficient operating system interaction. It has been acknowledged that operating system overhead can greatly affect the benefits provided by a new design feature. The reason why the operating system interaction has, for the most part, been ignored in past architectural studies is the lack of available tools that can generate kernel-laden traces. In this contribution we describe the ongoing efforts...
Trace Sampling for Desktop Applications on Windows NT
- Appears in Workload Characterization: Methodology and Case Studies, edited by John and Maynard, IEEE Computer
, 1998
"... This paper examines trace sampling for a suite of desktop application traces on Windows NT. This paper makes two contributions: we compare the accuracy of several sampling techniques to determine cache miss rates for these workloads, and we present a victim cache architecture study that demonstrates ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper examines trace sampling for a suite of desktop application traces on Windows NT. This paper makes two contributions: we compare the accuracy of several sampling techniques to determine cache miss rates for these workloads, and we present a victim cache architecture study that demonstrates that sampling can be used to drive such studies. Of the sampling techniques used for the cache miss ratio determinations, stitch, which assumes that the state of the cache at the beginning of a sample is the same as the state at the end of the previous sample, is the most effective for these workloads. This technique is more accurate than the others and is reliable for caches up to 64KB in size. 1 Introduction Trace-driven simulation is a common approach for evaluating memory systems. Trace-driven simulations demand large amounts of space and time, particularly for large caches and long running applications. These demands can be greatly reduced by employing sampling techniques at the expens...
Using Virtual Memory to Improve Cache and TLB Performance
, 1998
"... Using Virtual Memory to Improve Cache and TLB Performance by Theodore Haynes Romer Chairperson of Supervisory Committee Professor Brian N. Bershad Computer Science and Engineering This thesis introduces new operating system policies that use virtual memory to dynamically improve memory system perfo ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Using Virtual Memory to Improve Cache and TLB Performance by Theodore Haynes Romer Chairperson of Supervisory Committee Professor Brian N. Bershad Computer Science and Engineering This thesis introduces new operating system policies that use virtual memory to dynamically improve memory system performance. Overall application execution time is increasingly dependent on memory system performance, motivating the development of new techniques for reducing cache and translation lookaside buffer (TLB) miss rates. My thesis is that the operating system can effectively manage cache and TLB resources at runtime on behalf of applications. I develop and evaluate two examples of operating system memory system optimizations. First, I show how the operating system can optimize TLB performance by dynamically constructing superpages. I introduce a policy that analyzes the cause of each TLB miss, and uses this information to selectively create large pages, reducing the TLB miss rate without the increa...
The Architectural Implications of Pipeline and Batch Sharing in Scientific Workloads
, 2002
"... We present a study of six batch-pipelined scientific workloads. Whereas other studies focus on the behavior of a single application, we characterize an emerging type of workload which consists of pipelines of sequential processes that use file storage for communication and also share significant dat ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
We present a study of six batch-pipelined scientific workloads. Whereas other studies focus on the behavior of a single application, we characterize an emerging type of workload which consists of pipelines of sequential processes that use file storage for communication and also share significant data across a batch. This study includes measurements of the memory, CPU, and I/O requirements of individual components as well as analyses of I/O sharing within complete batches, as well as a discussion of the architectural ramifications of these new types of workloads.
Architectural Techniques to Accelerate Multimedia Applications on General-Purpose Processors
, 2001
"... General-purpose processors (GPPs) have been augmented with multime-dia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to ex-tract data level parallelism in multimedia and digital signal processing (DSP)app ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
General-purpose processors (GPPs) have been augmented with multime-dia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to ex-tract data level parallelism in multimedia and digital signal processing (DSP)applications. This dissertation consists of a comprehensive evaluation of theexecution characteristics of multimedia applications on SIMD enhanced GPPs,detection of bottlenecks in the execution of multimedia applications on SIMD enhanced GPPs, and the design and implementation of architectural techniques to eliminate and alleviate the impact of the various bottlenecks to acceleratemultimedia applications. This dissertation identifies several bottlenecks in the processing of SIMD enhanced multimedia and DSP applications on GPPs. It is found that approxi-mately 75-85% of instructions in the dynamic instruction stream of media work-loads are not performing useful computations but merely supporting the useful computations by performing address generation, address transformation/data reorganization, loads/stores, and loop branches. This leads to an under-utilization of the SIMD computation units with only 1-12% of the peak SIMD throughput being achieved. This dissertation proposes the use of hardware support to efficiently exe-cute the overhead/supporting instructions by overlapping them with the useful computation instructions. A 2-way GPP with SIMD extensions augmented with the proposed MediaBreeze hardware significantly outperforms a 16-way SIMD GPP without MediaBreeze hardware on multimedia kernels. On multimedia ap-plications, a 2-/4-way SIMD GPP augmented with MediaBreeze hardware is superior to a 4-/8-way SIMD GPP without MediaBreeze hardware. The per-formance improvements are achieved at an area cost that is less than 0.3% of current GPPs and power consumption that is less than 1% of the total processor power without elongating the critical path of the processor.

