Results 1 -
9 of
9
The YAGS branch prediction scheme
- In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture
, 1998
"... The importance of an accurate branch prediction mechanism has been well documented. Since the introduction of gshare [1] and the observation that aliasing in the PHT is a major factor in reducing prediction accuracy [2,3,4,5], several schemes have been proposed to reduce aliasing in the PHT [6, 7, 8 ..."
Abstract
-
Cited by 96 (0 self)
- Add to MetaCart
The importance of an accurate branch prediction mechanism has been well documented. Since the introduction of gshare [1] and the observation that aliasing in the PHT is a major factor in reducing prediction accuracy [2,3,4,5], several schemes have been proposed to reduce aliasing in the PHT [6, 7, 8, 9]. All these schemes are aimed at maximizing the prediction accuracy with the fewest resources. In this paper we introduce Yet Another Global Scheme (YAGS) — a new scheme to reduce the aliasing in the PHT — that combines the strong points of several previous schemes. YAGS introduces tags into the PHT that allows it to be reduced without sacrificing key branch outcome information. The size reduction more than offsets the cost of the tags. Our experimental results show that YAGS gives better
Operating system benchmarking in the wake of lmbench: A case study of the performance of NetBSD on the Intel x86 architecture
- In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems
, 1997
"... The lmbench suite of operating system microbenchmarks provides a set of portable programs for use in cross-platform comparisons. We have augmented the fmbench suite to increase its flexibility and precision, and to improve its methodological and statistical opera-tion. This enables the detailed stud ..."
Abstract
-
Cited by 79 (1 self)
- Add to MetaCart
The lmbench suite of operating system microbenchmarks provides a set of portable programs for use in cross-platform comparisons. We have augmented the fmbench suite to increase its flexibility and precision, and to improve its methodological and statistical opera-tion. This enables the detailed study of interactions between the operating system and the hardware architecture. We describe mod-ifications to bnbench, and then use our new benchmark suite, hbench:OS, to examine how the performance of operating system primitives under NetBSD has scaled with the processor evolution of the Intel x86 architecture. Our analysis shows that off-chip memory system design continues to influence operating system performance in a significant way and that key design decisions (such as suboptimal choices of DRAM and cache technology, and memory-bus and cache coherency protocols) can essentially nul-lify the performance benefits of the aggressive execution core and sophisticated on-chip memory system of a modem processor such as the Intel Pentium Pro. 1
A Decompositional Approach to Computer System Performance Evaluation
, 1997
"... Contents 1 Introduction 1 2 Decomposing the Performance of the Operating System Kernel 9 2.1 Related Work: Benchmarking Operating Systems 11 2.2 Microbenchmark Tools: Revising lmbench into hbench-OS 13 2.2.1 Timing Methodology 15 2.2.2 Statistical Methodology 16 2.2.3 Increased Parameterization 18 ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
Contents 1 Introduction 1 2 Decomposing the Performance of the Operating System Kernel 9 2.1 Related Work: Benchmarking Operating Systems 11 2.2 Microbenchmark Tools: Revising lmbench into hbench-OS 13 2.2.1 Timing Methodology 15 2.2.2 Statistical Methodology 16 2.2.3 Increased Parameterization 18 2.2.4 Context Switch Latency 18 2.2.5 Memory Bandwidths 20 2.2.6 New Output Format 21 2.3 Case Study: A Performance Decomposition for NetBSD on the Intel x86 Platform 22 2.3.1 Bulk Data Transfer 24 2.3.2 Process Creation 36 2.3.3 Signal Handler Installation 39 3 Extending the Performance Decomposition to User Applications 43 3.2 Case Study: Developing Tools 46 3.3 Case Study: The Apache Web Server 49 3.3.1 Step 1: Decomposing Apache's Internal Structure 49 3.3.2 Step 2: Connecting the Application and Operating System Hierarchies 51 3.4 Related Work: Understanding Application Performance 54 4 Distilling the Detail: Performance at the OS-Application Abstraction Boundary 57 4.2 Analysis of Met
Benchmarking Web Server Architectures: A Simulation Study on Micro Performance
- In Proceedings of the Fifth Workshop on Computer Architecture Evaluation using Commercial Workloads
, 2002
"... As Internet expands, the number of application servers, especially Web servers, has been increasing exponentially. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
As Internet expands, the number of application servers, especially Web servers, has been increasing exponentially.
Branch History Register Cache
"... Recent superscalar processors highly depend on efficient branch prediction to exploit a large amount of instruction level parallelism. However, it is known that branch prediction accuracy is degraded when process switches are present. Multithreading architectures are considered a better approach to ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Recent superscalar processors highly depend on efficient branch prediction to exploit a large amount of instruction level parallelism. However, it is known that branch prediction accuracy is degraded when process switches are present. Multithreading architectures are considered a better approach to increase the total throughput. In multithreading architectures, the process switch rate is very high and even a second level cache miss causes a process switch. In such an environment, branch prediction is severely degraded. In this paper, we propose a hardware technique to reduce the impact of process switches. This technique consists of adding a cache to store the Branch History Register on a process switch. We show that this scheme reduces the impact of process switches significantly especially when process switches occur very frequently. 1 Introduction In order to fully exploit the performance of recent superscalar processors, effective branch prediction schemes are essential. To impro...
OS-aware Tuning: Improving Instruction Cache Energy Efficiency on System Workloads
"... Low power has been considered as an important issue in instruction cache (I-cache) designs. Several studies have shown that the I-cache can be tuned to reduce power. These techniques, however, exclusively focus on user-level applications, even though there is evidence that many commercial and emergi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Low power has been considered as an important issue in instruction cache (I-cache) designs. Several studies have shown that the I-cache can be tuned to reduce power. These techniques, however, exclusively focus on user-level applications, even though there is evidence that many commercial and emerging workloads often involve heavy use of the operating system (OS). This study goes beyond previous work to explore the opportunities to design energy-efficient Icache for system workloads. Employing a full-system experimental framework and a wide range of workloads, we characterize user and OS I-cache accesses and motivate OSaware I-cache tuning to save power. We then present two techniques (OS-aware cache way lookup and OS-aware cache set drowsy mode) to reduce the dynamic and the static power consumption of I-cache. The proposed OS-aware cache way lookup reduces the number of parallel tag comparisons and data array read-outs for cache accesses to save dynamic Icache power in a given operation mode. The proposed OSaware cache set drowsy mode puts I-cache regions that are only heavily used by another operation mode to reduce leakage power. The proposed mechanisms require minimal hardware modification and addition. Simulation based experiments show that with no or negligible impact on performance, applying OS-aware tuning techniques yields significant dynamic and static power savings across the experimented applications. To our knowledge, this is the first work to explore cache power optimization by considering the interactions of application-OS-hardware. It is our belief that the proposed techniques can be applied to improve the I-cache energy efficiency on server processors mostly targeting on modern and commercial applications that heavily invoke OS activities. 1.
Title: Improving Hardware Branch Predictors using Artificial Neural Networks
, 1999
"... This research shows that using an Artificial Neural Network as the hardware branch predictor of a microprocessor leads to performance as good as standard branch predictors for comparable chip area. The results were obtained running several Spec95 benchmarks on an augmented version of the simplescala ..."
Abstract
- Add to MetaCart
This research shows that using an Artificial Neural Network as the hardware branch predictor of a microprocessor leads to performance as good as standard branch predictors for comparable chip area. The results were obtained running several Spec95 benchmarks on an augmented version of the simplescalar architecture simulator. The approach taken in this research is the first attempt to use Neural Networks to improve the design of hardware branch predictors, it points to a combination of static and dynamic techniques using artificial intelligence. The prediction rates achieved by the holistic-non-adaptive Neural Network predictor designed are promising. Even a simple Neural Network structure without an on-line adaptive mechanism performed better than current techniques for small predictor sizes. The neural net predictor achieved almost the same rates for most of the benchmarks of the Spec95 set and it was even 20 % more accurate for one of them. However, the NN predictors developed were not able to achieve the same prediction rates than bigger standard predictor configurations. The performance of the non-adaptive NN predictors substantially decreases when the number of dynamic
Operating System Benchmarking in the Wake of Lmbench: A Case Study of the Performance of NetBSD on the Intel x86 Architecture
"... The lmbench suite of operating system microbenchmarks provides a set of portable programs for use in cross-platform comparisons. We have augmented the lmbench suite to increase its flexibility and precision, and to improve its methodological and statistical operation. This enables the detailed study ..."
Abstract
- Add to MetaCart
The lmbench suite of operating system microbenchmarks provides a set of portable programs for use in cross-platform comparisons. We have augmented the lmbench suite to increase its flexibility and precision, and to improve its methodological and statistical operation. This enables the detailed study of interactions between the operating system and the hardware architecture. We describe modifications to lmbench, and then use our new benchmark suite, hbench:OS, to examine how the performance of operating system primitives under NetBSD has scaled with the processor evolution of the Intel x86 architecture. Our analysis shows that off-chip memory system design continues to influence operating system performance in a significant way and that key design decisions (such as suboptimal choices of DRAM and cache technology, and memory-bus and cache coherency protocols) can essentially nullify the performance benefits of the aggressive execution core and sophisticated on-chip memory system of a modern processor such as the Intel Pentium Pro. 1
A Decompositional Approach to
"... Contents 1 Introduction 1 2 Decomposing the Performance of the Operating System Kernel 9 2.1 Related Work: Benchmarking Operating Systems 11 2.2 Microbenchmark Tools: Revising lmbench into hbench-OS 13 2.2.1 Timing Methodology 15 2.2.2 Statistical Methodology 17 2.2.3 Increased Parameterization 18 ..."
Abstract
- Add to MetaCart
Contents 1 Introduction 1 2 Decomposing the Performance of the Operating System Kernel 9 2.1 Related Work: Benchmarking Operating Systems 11 2.2 Microbenchmark Tools: Revising lmbench into hbench-OS 13 2.2.1 Timing Methodology 15 2.2.2 Statistical Methodology 17 2.2.3 Increased Parameterization 18 2.2.4 Context Switch Latency 18 2.2.5 Memory Bandwidths 21 2.2.6 New Output Format 22 2.3 Case Study: A Performance Decomposition for NetBSD on the Intel x86 Platform 22 2.3.1 Bulk Data Transfer 24 2.3.2 Process Creation 36 2.3.3 Signal Handler Installation 39 3 Extending the Performance Decomposition to User Applications 43 3.2 Case Study: Developing Tools 46 3.3 Case Study: The Apache Web Server 49 3.3.1 Step 1: Decomposing Apache's Internal Structure 49 3.3.2 Step 2: Connecting the Application and Operating System Hierarchies 51 3.4 Related Work: Understanding Application Performance 54 4 Distilling the Detail: Performance at the OS-Application Abstraction Boundary 57 4.2 Analysis of Meth

