Results 1 -
5 of
5
Improving Memory-System Performance of Sparse Matrix-Vector Multiplication
- IBM Journal of Research and Development
, 1997
"... Sparse Matrix-Vector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instruction-level parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
Sparse Matrix-Vector Multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describe techniques that increase instruction-level parallelism and improve performance. The techniques include reordering to reduce cache misses originally due to Das et al., blocking to reduce load instructions, and prefetching to prevent multiple load-store units from stalling simulteneously. The techniques improve performnance from about 40 Mflops (on a well-ordered matrix) to over 100 Mflops on a 266 Mflops machine. The techniques are applicable to other superscalar RISC processors as well and have improved performance on a Sun UltraSparc I workstation, for example. 1 Introduction Sparse matrix-vector multiplication is an important computational kernel in many iterative linear solvers (see [5], for example). Unfortunately, on many computers this kernel runs slowly relative to other numerical codes, such as dense matrix computations. This paper propos...
Register File Design Considerations in Dynamically Scheduled Processors
- In Proceedings of the Second IEEE Symposium on High-Performance Computer Architecture
, 1995
"... We have investigated the register file requirements of dynamically scheduled processors using register renaming and dispatch queues running the SPEC92 benchmarks. We looked at processors capable of issuing either four or eight instructions per cycle and found that in most cases implementing precise ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
We have investigated the register file requirements of dynamically scheduled processors using register renaming and dispatch queues running the SPEC92 benchmarks. We looked at processors capable of issuing either four or eight instructions per cycle and found that in most cases implementing precise exceptions requires a relatively small number of additional registers compared to imprecise exceptions. Systems with aggressive non-blocking load support were able to achieve performance similar to processors with perfect memory systems at the cost of some additional registers. Given our machine assumptions, we found that the performance of a four-issue machine with a 32-entry dispatch queue tends to saturate around 80 registers. For an eight-issue machine with a 64-entry dispatch queue performance does not saturate until about 128 registers. Assuming the machine cycle time is proportional to the register file cycle time, the 8-issue machine yields only 20% higher performance than the 4-issue machine due in part...
Tools and Techniques for Memory System Design and Analysis
, 1995
"... As processor cycle times decrease, memory system performance becomes ever more critical to overall performance. Continually changing technology and workloads create a moving target for computer architects in their effort to design cost-effective memory systems. Meeting the demands of ever changing w ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
As processor cycle times decrease, memory system performance becomes ever more critical to overall performance. Continually changing technology and workloads create a moving target for computer architects in their effort to design cost-effective memory systems. Meeting the demands of ever changing workloads and technology requires the following: . Efficient techniques for evaluating memory system performance, . Tuning programs to better use the memory system, and . New memory system designs. This thesis makes contributions in each of these areas. Hardware and software developers rely on simulation to evaluate new ideas. In this thesis, I present a new interface for writing memory system simulators---the active memory abstraction---designed specifically for simulators that process memory references as the application executes and avoids storing them to tape or disk. Active memory allows simulators to optimize for the common case, e.g., cache hits, achieving simulation times only 2-6 t...
Static Instruction Scheduling For Dynamic Issue Processors
, 1997
"... Many modern computer processors are based on an out-of-order superscalar execution model. These processors make use of a sophisticated dynamic issue mechanism that reorders the program instructions at execution time to overcome hazards and expose more Instruction-Level Parallelism (ILP). However, th ..."
Abstract
- Add to MetaCart
Many modern computer processors are based on an out-of-order superscalar execution model. These processors make use of a sophisticated dynamic issue mechanism that reorders the program instructions at execution time to overcome hazards and expose more Instruction-Level Parallelism (ILP). However, this valuable mechanism is ignored by the compilers commonly used on these architectures. For a given program, the current compiler technology focuses on exposing as much ILP as possible without taking into consideration the instruction reordering performed by the processor at runtime. This thesis presents a novel approach to the instruction scheduling problem for dynamic issue processors. Our approach aims at generating an instruction sequence with a low register pressure and a high level of ILP exploitable by the dynamic issue mechanism of the processor. Our objective is to improve the performance of the program by taking advantage of the out-of-order execution and register renaming mechanis...

