Results 11 - 20
of
40
A fast and accurate framework to analyze and optimize cache memory behavior
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 2004
"... The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performan ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The gap between processor and main memory performance increases every year. In order to overcome this problem, cache memories are widely used. However, they are only effective when programs exhibit sufficient data locality. Compile-time program transformations can significantly improve the performance of the cache. To apply most of these transformations, the compiler requires a precise knowledge of the locality of the different sections of the code, both before and after being transformed. Cache miss equations (CMEs) allow us to obtain an analytical and precise description of the cache memory behavior for loop-oriented codes. Unfortunately, a direct solution of the CMEs is computationally intractable due to its NP-complete nature. This article proposes a fast and accurate approach to estimate the solution of the CMEs. We use sampling techniques to approximate the absolute miss ratio of each reference by analyzing a small subset of the iteration space. The size of the subset, and therefore the analysis time, is determined by the accuracy selected by the user. In order to reduce the complexity of the algorithm to solve CMEs, effective mathematical techniques have been developed to analyze the subset of the iteration space that is being considered. These techniques exploit some properties of the particular polyhedra represented by CMEs.
Tailoring Router Architectures to Performance Requirements in Cut-Through Networks
, 1999
"... Message-passing parallel machines have emerged as a cost-effective platform for exploiting concurrency in a variety of applications. These multicomputer systems employ a wide range of policies for routing, switching, arbitration, queueing, and flow control, implemented in the router hardware that co ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Message-passing parallel machines have emerged as a cost-effective platform for exploiting concurrency in a variety of applications. These multicomputer systems employ a wide range of policies for routing, switching, arbitration, queueing, and flow control, implemented in the router hardware that connects an individual processing node to the interconnection fabric and manages traffic flowing through the node en route to other destinations. To address the requirements of emerging applications, we develop new techniques for designing and evaluating new router architectures that tailor network policies to application characteristics. These results facilitate the development of effective support for communication in real-time systems and local area networks, as well as more traditional multicomputer domains like high-speed scientific computing. Most modern routers employ cut-through switching schemes, such as virtual cut-through and wormhole switching, that permit an arriving packet to proceed directly to an idle outgoing link. We develop analytical models for evaluating cut-through routing algorithms with different degrees of adaptivity. The analytical results permit an efficient evaluation of large networks, while detailed comparisons with simulation results characterize the subtle effects of the simplifying assumptions in the analysis; in particular, cut-through networks introduce unique dependencies between adjacent nodes. Additional simulation experiments show that the network topologies, routing algorithms, and traffic patterns in modern multicomputers exacerbate these effects. Based on these results, we present a routing algorithm that capitalizes on inter-node dependencies to improve network performance.
Efficient Instruction Cache Simulation And Execution Profiling with a Threaded-Code Interpreter
"... We present an extension to an existing SPARC V8 instruction set simulator, SimICS, to support accurate profiling of branches and instruction cache misses. SimICS had previously supported profiling data cache efficiency and virtual memory performance (TLB misses), and estimated execution profiling us ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
We present an extension to an existing SPARC V8 instruction set simulator, SimICS, to support accurate profiling of branches and instruction cache misses. SimICS had previously supported profiling data cache efficiency and virtual memory performance (TLB misses), and estimated execution profiling using sampling. The new design allows a system-level, threaded-code simulator of a computer system to efficiently support a relatively complete range of instrumentation. Principal applications include computer architecture studies and performance tuning of software. Both application areas require reasonable performance in order to support realistic workloads, and both benefit from the flexibility, generality, and portability of a fast threaded-code simulator. The presented design supports multiprocessor simulation, system-level (operating system) programs, and, in principle, arbitrary user programs including run-time generated code. We evaluate the performance using the SPECint95 benchmark suite, and the result, with full profiled instrumentation enabled, is an execution time 26-108 times slower than native execution.
Improving Performance of Bus-Based Multiprocessors
, 1995
"... Processors have become both cheaper and faster in recent years, to the point where it has become practical to use multiple processors in a single system. An important aspect of such systems is how processors are connected to each other and to main memory. The simplest design uses a single, common bu ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Processors have become both cheaper and faster in recent years, to the point where it has become practical to use multiple processors in a single system. An important aspect of such systems is how processors are connected to each other and to main memory. The simplest design uses a single, common bus. The main problem with single bus multiprocessors is the lack of bus bandwidth, which limits the number of processors that can be connected effectively. This dissertation addresses this problem by examining protocol improvements for single-bus multiprocessors as well as investigating the use of clustering to increase the number of processors that can be connected together effectively. We developed a subblock cache coherency protocol to enable programs to take advantage of a program's spatial locality, while av...
PP-MESS-SIM: A flexible and extensible simulator for evaluating multicomputer networks
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... Abstract—This paper presents pp-mess-sim, an object-oriented discrete-event simulation environment for evaluating interconnection networks in message-passing systems. The simulator provides a toolbox of various network topologies, communication workloads, routing-switching algorithms, and router mod ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Abstract—This paper presents pp-mess-sim, an object-oriented discrete-event simulation environment for evaluating interconnection networks in message-passing systems. The simulator provides a toolbox of various network topologies, communication workloads, routing-switching algorithms, and router models. By carefully defining the boundaries between these modules, pp-mess-sim creates a flexible and extensible environment for evaluating different aspects of network design. The simulator models emerging multicomputer networks that can support multiple routing and switching schemes simultaneously; ppmess-sim achieves this flexibility by associating routing-switching policies, traffic patterns, and performance metrics with collections of packets, instead of the underlying router model. Besides providing a general framework for evaluating router architectures, ppmess-sim includes a cycle-level model of the PRC, a programmable router for point-to-point distributed systems. The PRC model captures low-level implementation details, while another high-level model facilitates experimentation with general router design issues. Sample simulation experiments capitalize on this flexibility to compare network architectures under various application workloads. Index Terms—Multicomputers, routers, routing, switching, object-oriented simulation.
An architecture workbench for multicomputers
- in Proc. of the 11th Int. Parallel Processing Symposium
, 1997
"... The large design space of modern computer architectures calls for performance modelling tools to facilitate the evaluation of different alternatives. In this paper, we give an overview of the Mermaid multicomputer simulation environment. This environment allows the evaluation of a wide range of arch ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
The large design space of modern computer architectures calls for performance modelling tools to facilitate the evaluation of different alternatives. In this paper, we give an overview of the Mermaid multicomputer simulation environment. This environment allows the evaluation of a wide range of architectural design tradeoffs while delivering reasonable simulation performance. To achieve this, simulation takes place at a level of abstract machine instructions rather than at the level of real instructions. Moreover, a less detailed mode of simulation is also provided. So when accuracy is not the primary objective, this simulation mode can yield high simulation efficiency. As a consequence, Mermaid makes both fast prototyping and accurate evaluation of multicomputer architectures feasible. 1
Applying Programming Language Implementation Techniques to PROCESSOR SIMULATION
, 2000
"... This memoization makes the simulator run 5--12 times faster, with no change in simulation results (e.g., cycle count). Combining direct-execution and memoization, FastSim simulates a MIPS R10000-like microarchitecture with a 190--360 times slowdown (i.e., simulation time over native benchmark execut ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This memoization makes the simulator run 5--12 times faster, with no change in simulation results (e.g., cycle count). Combining direct-execution and memoization, FastSim simulates a MIPS R10000-like microarchitecture with a 190--360 times slowdown (i.e., simulation time over native benchmark execution time on the host), which is an order of magnitude faster than SimpleScalar.
The Mermaid Architecture-workbench for Multicomputers
, 1996
"... Cache hierarchy Bus Figure 3: The template architecture models. defines a bus component. It is a simple forwarding mechanism, carrying out arbitration upon multiple accesses. The parameters used to configure this component include buswidth, bus cycle-time and arbitration details. Changing the bus t ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Cache hierarchy Bus Figure 3: The template architecture models. defines a bus component. It is a simple forwarding mechanism, carrying out arbitration upon multiple accesses. The parameters used to configure this component include buswidth, bus cycle-time and arbitration details. Changing the bus to a more complex structure, such as a multistage network, can be done without too much remodelling effort. In that case, only a new Pearl module needs to be written, replacing the bus component within the template model. Finally, the memory component simulates a simple DRAM memory. It is parameterized with memory size, memory refresh rate, and memory access latencies. 4.2 Multi-node communication model A node within the communication template model is constructed from an abstract processor, a router and multiple communication links. This setup is shown in Figure 3(b). The nodes are connected in a topology that reflects the physical interconnection scheme of the multicomputer, resulting in ...
Performance Debugging and Tuning using an Instruction-Set Simulator
, 1997
"... Instruction-set simulators allow programmers a detailed level of insight into, and control over, the execution of a program, including parallel programs and operating systems. In principle, instruction set simulation can model any target computer and gather any statistic. Furthermore, such simulator ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Instruction-set simulators allow programmers a detailed level of insight into, and control over, the execution of a program, including parallel programs and operating systems. In principle, instruction set simulation can model any target computer and gather any statistic. Furthermore, such simulators are usually portable, independent of compiler tools, and deterministic---allowing bugs to be recreated or measurements repeated. Though often viewed as being too slow for use as a general programming tool, in the last several years their performance has improved considerably. We describe SIMICS, an instruction set simulator of SPARC-based multiprocessors developed at SICS, in its rôle as a general programming tool. We discuss some of the benefits of using a tool such as SIMICS to support various tasks in software engineering, including debugging, testing, analysis, and performance tuning. We present in some detail two test cases, where we've used SimICS to support analysis and performance ...
Validating an Architectural Simulator
- Department of Computer Science, University of Massachusetts at Amherst
, 1996
"... This paper reports on our experiences in building an execution-driven architectural simulator that is meant to accurately capture performance costs of a machine for a particular class of software, namely, network protocol stacks such as TCP/IP. The simulator models a single processor of our Silicon ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper reports on our experiences in building an execution-driven architectural simulator that is meant to accurately capture performance costs of a machine for a particular class of software, namely, network protocol stacks such as TCP/IP. The simulator models a single processor of our Silicon Graphics Challenge shared-memory multiprocessor, which has 100 MHz MIPS R4400 chips and two levels of cache memory. We describe our validation approach, show accuracy results averaging within 5 percent, and present the lessons learned in validating an architectural simulator. 1 Introduction We have designed and implemented a execution-driven uniprocessor simulator for our 100 MHz R4400based SGI Challenge [6]. The purpose of this simulator is to understand the performance costs of a network protocol stack running in user space on our SGI machine, and to guide us in identifying and reducing bottlenecks [11]. The primary goal of this simulator has been to accurately model performance costs for...

