Results 1 -
8 of
8
Performance Tradeoffs In Multithreaded Processors
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1991
"... ... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors th ..."
Abstract
-
Cited by 111 (5 self)
- Add to MetaCart
... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects. The model is validated through our own simulations and by comparison with previously published simulation results. Our results indicate that processors can substantially benefit from multithreading, even in systems with small caches. Large caches yield close to full processor utilization with as few as two to four contexts, while small caches may require up to four times as many contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization.
Automatic and Efficient Evaluation of Memory Hierarchies for Embedded Systems
, 1999
"... Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
Automation is the key to the design of future embedded systems as it permits application-specific customization while keeping design costs low. A key problem faced by automatic design systems is evaluating the performance of the vast number of alternative designs in a timely manner. For this paper, we focus on an embedded system consisting of the following components: a VLIW processor, instruction cache, data cache, and second-level unified cache. A hierarchical approach of partitioning the system into its constituent components and evaluating each component individually is utilized. The performance of each processor is evaluated independent of its memory hierarchy, and each of the caches is simulated using the traces from a single reference processor. Since the changes in the processor architecture do indeed affect the address traces and thus the performance of the memory hierarchy, the overall performance is inaccurate. To overcome this error, the changes in the processor architecture are modeled as a dilation of the reference processor's address trace, where each instruction block in the trace is conceptually stretched out by the dilation coefficient. This approach provides a projected cache performance that more accurately accounts for changes in the processor architecture. In order to understand the accuracy of the dilation model, we separate the possible errors that the model introduces and quantify these errors on a set of benchmarks. The results show the dilation model is effective for most of the design space and facilitates efficient automatic design.
Techniques for Cache and Memory Simulation Using Address Reference Traces
- Int. J. Comput. Simul
, 1990
"... Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Simulation using address reference traces is one of the primary methods for the performance evaluation of the memory hierarchy of computer systems. In this paper we survey the techniques used in such a simulation. In both the uniprocessor and shared-memory multiprocessor cases, the issues can be divided into trace collection, trace storage, and trace usage. Trace collection can employ several hardware or software methods. Common concerns are that the collection method capture all of the address references of interest, that the execution overhead of the collection method is not excessive, and that the trace is of adequate length. The increasing size of caches heightens the adequate length concern. Trace storage is of concern because of the large size of traces. Techniques for trace compression and trace reduction have been developed. Trace usage is of concern because of the length of a simulation. Under some circumstances it is possible to evaluate multiple cache sizes in a si...
Scheduling for Cache Affinity in Parallelized Communication Protocols
- In Proceedings of 1995 SIGMETRICS/Performance International Conference on Measurement and Modeling of Computer Systems
, 1994
"... In this paper, we explore the benefits of processor cache affinity scheduling of parallelized network protocol processing. We find that affinity scheduling, which has not previously been shown to be of significant benefit to common applications, can provide large performance gain in the context of ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
In this paper, we explore the benefits of processor cache affinity scheduling of parallelized network protocol processing. We find that affinity scheduling, which has not previously been shown to be of significant benefit to common applications, can provide large performance gain in the context of parallelized protocol processing. We conduct a set of multiprocessorexperiments designed to measure packet processing time in a UDP/IP/FDDI protocol stack in the x-kernel on an SGI Challenge XL multiprocessor. These measurements are then used to parameterize a combined simulation/analytic model of multiprocessor protocol processing. Our simulation results show that affinity scheduling can significantly reduce message delay associated with protocol processing, allowing a host to support a greater number of concurrent streams, to provide a higher maximum throughput to individual streams, and to decrease the end-to-end latency seen by an application. We find the reduction in end-to-end l...
Efficient Profile-Based Evaluation of Randomising Set Index Functions For Cache Memories
- In 2nd International Symposium on Performance Analysis of Systems and Software
, 2001
"... The performance of direct mapped caches is degraded by conflict misses. It has been shown that conflict misses can be reduced by using randomising set index functions, such that repeated conflicts are avoided. However, optimising the set index function requires time consuming simulations, because th ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The performance of direct mapped caches is degraded by conflict misses. It has been shown that conflict misses can be reduced by using randomising set index functions, such that repeated conflicts are avoided. However, optimising the set index function requires time consuming simulations, because the design space of randomising set index functions is very large. Therefore, we developed a profilebased technique that allows one to make a fast estimation of the miss ratio incurred by a set index function. Using this technique, one can perform a fast, initial exploration of the design space of set index functions, followed by a slower, but more accurate, analysis using simulation. The profilebased technique is based on a new representation of randomising set index functions using null spaces. The profilebased technique consists of two phases. In the first phase, a program is profiled and in the second phase, a score is computed from the profile data and the null space of a set index function. We show that the computed score closely reflects the miss ratio incurred by that set index function. Computing a score is a simple operation that requires no simulation time. Therefore, only one profiling run is required to estimate the miss ratios for a wide range of set index functions. 1
Pseudo-Address Generation Algorithm of Packet Destinations for Internet Performance Simulation
, 2001
"... This paper investigates the stochastic property of the packet destinations and proposes an address generation algorithm which is applicable for describing various Internet access patterns. We assume that a stochastic process of Internet access satisfies the stationary condition and derive the fundam ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
This paper investigates the stochastic property of the packet destinations and proposes an address generation algorithm which is applicable for describing various Internet access patterns. We assume that a stochastic process of Internet access satisfies the stationary condition and derive the fundamental structure of the address generation algorithm. Pseudo IP-address sequence generated from our algorithm gives dependable cache performance and reproduces the results obtained from tracedriven simulation. The proposed algorithm is applicable not only to the destination IP address but also to the destination URLs of packets, and is useful for simulation studies of Internet performance, Web caching, DNS, and so on. Keywords---Internet, Destination address, LRU stack, World Wide Web, Caching. I.
Comparing Caching Techniques for Multitasking Real-Time Systems
, 1997
"... Correctness in real-time computing depends on the logical result and the time when it is available. Real-time operating systems need to know the timing behavior of applications to ensure correct real-time system behavior. Thus, predictability in the underlying hardware operation is required. Unfortu ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Correctness in real-time computing depends on the logical result and the time when it is available. Real-time operating systems need to know the timing behavior of applications to ensure correct real-time system behavior. Thus, predictability in the underlying hardware operation is required. Unfortunately, standard, embedded cache management policies in microprocessors are designed for excellent probabilistic behavior but lack predictability, especially in a multitasking environment. In this article we examine the two popular cache management policies that support predictable cache behavior in a multitasking environment and quantitatively compare them. Using a novel application of an existing analytical cache model we show that neither policy is best in general and delimit the system characteristics where each is most effective. 1 Introduction In real-time computing, correct operation depends on both the logical result and when it is available. Real-time systems have the characterist...
Dependability Analysis of Fault-Tolerant Multiprocessor Architectures through Simulated Fault Injection (Chapter 5 and 6)
, 1993
"... Introduction Computer systems achieve fault-tolerance primarily through redundancy. Multiple versions of a software routine can be executed to overcome implementation errors in the application code. Hardware can be replicated and operated in parallel or sequentially, as a series of spares, to surviv ..."
Abstract
- Add to MetaCart
Introduction Computer systems achieve fault-tolerance primarily through redundancy. Multiple versions of a software routine can be executed to overcome implementation errors in the application code. Hardware can be replicated and operated in parallel or sequentially, as a series of spares, to survive logic faults. Redundant software is expensive to develop, and increases memory requirements and execution time. Redundant hardware is difficult to design, and adds to the cost, size, weight and power consumption of a machine. Many fault-tolerant applications, such as the control of fly-by-wire aircraft and deep space probes, have physical limitations on the amount of redundancy that can be incorporated into a system. Cost is always a consideration when adding redundancy to improve fault-tolerance. The level of redundancy needed is determined by dependability requirements and the nature of the faults and errors that can be expected to affect a system. The behavior of processors in th

