Results 1 -
7 of
7
Optimizing Instruction Cache Performance for Operating System Intensive Workloads
- IEEE Transactions on Computers
, 1995
"... High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though ther ..."
Abstract
-
Cited by 61 (11 self)
- Add to MetaCart
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however, has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. Therefore, it is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes in detail the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. As a result, interference...
The Performance of the Cedar Multistage Switching Network
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... While multistage switching networks for vector multiprocessors have been studied extensively, detailed evaluations of their performance are rare. Indeed, analytical models, simulations with pseudo-synthetic loads, studies focused on average-value parameters, and measurements of networks disconnected ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
While multistage switching networks for vector multiprocessors have been studied extensively, detailed evaluations of their performance are rare. Indeed, analytical models, simulations with pseudo-synthetic loads, studies focused on average-value parameters, and measurements of networks disconnected from the machine, all provide limited information. In this paper, instead, we present an in-depth empirical analysis of a multistage switching network in a realistic setting: we use hardware probes to examine the performance of the omega network of the Cedar shared-memory machine executing real applications. The machine is configured with 16 vector processors. The analysis suggests that the performance of multistage switching networks is limited by traffic non-uniformities. We identify two major non-uniformities that degrade Cedar's performance and are likely to slow down other networks too. The first one is the contention caused by the return messages in a vector access as they converge fr...
Improving the Data Cache Performance of Multiprocessor Operating Systems
- 2nd International Symposium on High-Performance Computer Architecture
, 1996
"... Bus-based shared-memory multiprocessors with coherent caches have recently become very popular. To achieve high performance, these systems rely on increasingly sophisticated cache hierarchies. However, while these machines often run loads with substantial operating system activity, performance measu ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Bus-based shared-memory multiprocessors with coherent caches have recently become very popular. To achieve high performance, these systems rely on increasingly sophisticated cache hierarchies. However, while these machines often run loads with substantial operating system activity, performance measurements have consistently indicated that the operating system uses the data cache hierarchy poorly. In this paper, we address the issue of how to eliminate most of the data cache misses in a multiprocessor operating system while still using off-the-shelf processors. We use a performance monitor to examine traces of a 4processor machine running four system-intensive loads under UNIX. Based on our observations, we propose hardware and software support that targets block operations, coherence activity, and cache conflicts. For block operations, simple cache bypassing or prefetching schemes are undesirable. Instead, it is best to use a DMA-like scheme that pipelines the data transfer in the bus ...
Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies
- IEEE Transactions on Computers
, 1999
"... AbstractÐHigh-performance multiprocessor workstations are becoming increasingly popular. Since many of the workloads running on these machines are operating-system intensive, we are interested in exploring the types of support for the operating system that the memory hierarchy of these machines shou ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
AbstractÐHigh-performance multiprocessor workstations are becoming increasingly popular. Since many of the workloads running on these machines are operating-system intensive, we are interested in exploring the types of support for the operating system that the memory hierarchy of these machines should provide. In this paper, we evaluate a comprehensive set of hardware and software supports that minimize the performance losses for the operating system in a sophisticated cache hierarchy. These supports, selected from recent papers, are code layout optimization, guarded sequential instruction prefetching, instruction stream buffers, support for block operations, support for coherence activity, and software data prefetching. We evaluate these supports under a simulated environment. We show that they have a largely complementary impact and that, when combined, speed up the operating system by an average of 40 percent. Finally, a cost-performance comparison of these schemes suggests that the most cost-effective ones are code layout optimization and block operation support, while the least cost-effective one is software data prefetching. Index TermsÐCache hierarchies, shared-memory multiprocessors, architectural support for operating system, prefetching, tracedriven simulations, performance, block operations. 1
Low Perturbation Address Trace Collection for Operating System, Multiprogrammed, and Parallel Workloads in Multiprocessors
- Multiprogrammed, and Parallel Workloads in Multiprocessors,º technical report, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign
, 1996
"... While address trace analysis is a popular method to evaluate the memory system of computers, getting accurate traces of operating system, multiprogrammed, and parallel workloads in multiprocessors is hard. This is because the true behavior of these real-time loads can be easily altered by the tracin ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
While address trace analysis is a popular method to evaluate the memory system of computers, getting accurate traces of operating system, multiprogrammed, and parallel workloads in multiprocessors is hard. This is because the true behavior of these real-time loads can be easily altered by the tracing activity. To minimize perturbation, it is usually necessary to use trace-gathering hardware devices. Unfortunately, these devices usually gather limited information. For example, they often only capture physical addresses, can only collect references that miss in on-chip caches, and can only monitor a very small time window. In this paper, we improve the capability of such devices. We present a methodology to instrument operating system and applications to transfer a wide variety of information to the trace-gathering hardware with very little perturbation. For instance, the information transferred includes the virtual to physical address mapping or the sequence of basic blocks executed. Ea...
Exploiting Multiprocessor Memory Hierarchies For Operating Systems
, 1996
"... d this mentorship into a joyful and valuable life experience. Working very closely together, under his guidance, we persisted through numerous difficult times together as well as shared many happy rewards of success. I believe that, as I move into my career, I will carry with my continuous benefits ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
d this mentorship into a joyful and valuable life experience. Working very closely together, under his guidance, we persisted through numerous difficult times together as well as shared many happy rewards of success. I believe that, as I move into my career, I will carry with my continuous benefits from all I have learned from him. I also thank the members of my thesis committee, Professor David Padua and Professor Sharad Mehrotra, for their comments and suggestions in this dissertation. I would like to thank my friends and graduate students in CSRD: Zheng Zhang, Liuxi Yang, Russell Daigle, and Sharad Mehrotra. Zheng Zhang introduced me to Professor Torrellas and brought me into the world of computer architecture research. He consistently helped me with his insights on many important research issues. We also shared and enjoyed our common interests beyond research. My friendship with Liuxi Yang began in our undergraduate years. Beyond his remarkable role in my personal life, he always d
Low Perturbation Address Trace Collection with Simple Hardware Performance Monitors
"... While address trace analysis is a popular method to evaluate the memory system of computers, getting accurate traces of operating system, multiprogrammed, and parallel workloads in multiprocessors is hard. This is because the true behavior of these real-time loads can be easily altered by the tra ..."
Abstract
- Add to MetaCart
While address trace analysis is a popular method to evaluate the memory system of computers, getting accurate traces of operating system, multiprogrammed, and parallel workloads in multiprocessors is hard. This is because the true behavior of these real-time loads can be easily altered by the trace collecting activity. To minimize perturbation, it is usually necessary to use trace-gathering hardware performance monitors. Unfortunately, these devices often have limitations. For example, they often only capture physical addresses, can only collect references that miss in on-chip caches, and can only monitor a very small time window. In this paper, we show how to improve the capability of such devices. We present a simple methodology to instrument operating system and applications to transfer a wide variety of information to the trace-gathering hardware with little perturbation. For instance, the information transferred includes the virtual to physical address mapping or the seq...

