Results 1 - 10
of
88
Design and Evaluation of a Compiler Algorithm for Prefetching
- in Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1992
"... Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memo ..."
Abstract
-
Cited by 450 (21 self)
- Add to MetaCart
Software-controlled data prefetching is a promising technique for improving the performance of the memory subsystem to match today's high-performance processors. While prefetching is useful in hiding the latency, issuing prefetches incurs an instruction overhead and can increase the load on the memory subsystem. As a result, care must be taken to ensure that such overheads do not exceed the benefits.
FAST VOLUME RENDERING USING A SHEAR-WARP FACTORIZATION OF THE VIEWING TRANSFORMATION
, 1995
"... Volume rendering is a technique for visualizing 3D arrays of sampled data. It has applications in areas such as medical imaging and scientific visualization, but its use has been limited by its high computational expense. Early implementations of volume rendering used brute-force techniques that req ..."
Abstract
-
Cited by 422 (2 self)
- Add to MetaCart
Volume rendering is a technique for visualizing 3D arrays of sampled data. It has applications in areas such as medical imaging and scientific visualization, but its use has been limited by its high computational expense. Early implementations of volume rendering used brute-force techniques that require on the order of 100 seconds to render typical data sets on a workstation. Algorithms with optimizations that exploit coherence in the data have reduced rendering times to the range of ten seconds but are still not fast enough for interactive visualization applications. In this thesis we present a family of volume rendering algorithms that reduces rendering times to one second. First we present a scanline-order volume rendering algorithm that exploits coherence in both the volume data and the image. We show that scanline-order algorithms are fundamentally more efficient than commonly-used ray casting algorithms because the latter must perform analytic geometry calculations (e.g. intersecting rays with axis-aligned boxes). The new scanline-order algorithm simply streams through the volume and the image in storage order. We describe variants of the algorithm for both parallel and perspective projections and
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
- HPCA-4
, 1998
"... As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these "single-chip multiprocessors," however, we must find a way to parallelize non-numeric applicat ..."
Abstract
-
Cited by 210 (8 self)
- Add to MetaCart
As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these "single-chip multiprocessors," however, we must find a way to parallelize non-numeric applications. Unfortunately, compilers have had little success in parallelizing non-numeric codes due to their complex access patterns. This paper explores the potential for using thread-level data speculation (TLDS) to overcome this limitation by allowing the compiler to view parallelization solely as a cost/benefit tradeoff, rather than something which is likely to violate program correctness. Our experimental results demonstrate that with realistic compiler support, TLDS can offer significant program speedups. We also demonstrate that through modest hardware extensions, a generic single-chip multiprocessor could support TLDS by augmenting its cache coherence scheme to detect dependence violations, and by using the primary data caches to buffer speculative state.
Compiler-Based Prefetching for Recursive Data Structures
- In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications has ..."
Abstract
-
Cited by 171 (13 self)
- Add to MetaCart
Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. While prefetching has enjoyed considerable success in array-based numeric codes, its potential in pointer-based applications has remained largely unexplored. This paper investigates compilerbased prefetching for pointer-based applications---in particular, those containing recursive data structures. We identify the fundamental problem in prefetching pointer-based data structures and propose a guideline for devising successful prefetching schemes. Based on this guideline, we design three prefetching schemes, we automate the most widely applicable scheme (greedy prefetching) in an optimizing research compiler, and we evaluate the performance of all three schemes on a modern superscalar processor similar to the MIPS R10000. Our results demonstrate that compiler-inserted prefetching can significantly improve the execution...
Using the SimOS Machine Simulator to Study Complex Computer Systems
- ACM TRANSACTIONS ON MODELING AND COMPUTER SIMULATION
, 1997
"... ... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simul ..."
Abstract
-
Cited by 143 (5 self)
- Add to MetaCart
... This paper identifies two challenges that machine simulators such as SimOS must overcome in order to effectively analyze large complex workloads: handling long workload execution times and collecting data effectively. To study long-running workloads, SimOS includes multiple interchangeable simulation models for each hardware component. By selecting the appropriate combination of simulation models, the user can explicitly control the tradeoff between simulation speed and simulation detail. To handle the large amount of low-level data generated by the hardware simulation models, SimOS contains flexible annotation and event classification mechanisms that map the data back to concepts meaningful to the user. SimOS has been extensively used to study new computer hardware designs, to analyze application performance, and to study operating systems. We include two case studies that demonstrate how a low-level machine simulator such as SimOS can be used to study large and complex workloads.
Trace-Driven Memory Simulation: A Survey
- ACM Computing Surveys
, 2004
"... This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show t ..."
Abstract
-
Cited by 133 (0 self)
- Add to MetaCart
This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show that no single method is best when all criteria, including accuracy, speed, memory, flexibility, portability, expense, and ease of use are considered. In a concluding section, we examine fundamental limitations to trace-driven simulation, and survey some recent developments in memory simulation that may overcome these bottlenecks
Fine-grained dynamic instrumentation of commodity operating system kernels
, 1999
"... We have developed a technology, fine-grained dynamic instrumentation of commodity kernels, which can splice (insert) dynamically generated code before almost any machine code instruction of a completely unmodified running commodity operating system kernel. This technology is well-suited to performan ..."
Abstract
-
Cited by 107 (5 self)
- Add to MetaCart
We have developed a technology, fine-grained dynamic instrumentation of commodity kernels, which can splice (insert) dynamically generated code before almost any machine code instruction of a completely unmodified running commodity operating system kernel. This technology is well-suited to performance profiling, debugging, code coverage, security auditing, runtime code optimizations, and kernel extensions. We have designed and implemented a tool called KernInst that performs dynamic instrumentation on a stock production Solaris kernel running on an UltraSPARC. On top of KernInst, we have implemented a kernel performance profiling tool, and used it to understand kernel and application performance under a Web proxy server workload. We used this information to make two changes (one to the kernel, one to the proxy) that cumulatively reduce the percentage of elapsed time that the proxy spends opening disk cache files from 40 % to 7%. 1
The Superthreaded Processor Architecture
, 1999
"... The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism available in application programs. The superthreaded processor, on the other hand, is a concurrent multithreaded architecture (CMA) that can exploit the multip ..."
Abstract
-
Cited by 82 (13 self)
- Add to MetaCart
The common single-threaded execution model limits processors to exploiting only the relatively small amount of instruction-level parallelism available in application programs. The superthreaded processor, on the other hand, is a concurrent multithreaded architecture (CMA) that can exploit the multiple granularities of parallelism available in general-purpose application programs. Unlike other CMAs that rely primarily on hardware for run-time dependence detection and speculation, the superthreaded processor combines compiler-directed thread-level speculation of control and data dependences with run-time data dependence verification hardware. This hybrid of a superscalar processor and a multiprocessor-ona -chip can utilize many of the existing compiler techniques used in traditional parallelizing compilers developed for multiprocessors. Additional unique compiler techniques, such as the conversion of data speculation into control speculation, are also introduced to generate the superthre...
Efficient Superscalar Performance through Boosting
, 1992
"... The foremost goal of superscalar processor design is to increase performance through tie exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates in non-numerical applications. The general trend ..."
Abstract
-
Cited by 78 (5 self)
- Add to MetaCart
The foremost goal of superscalar processor design is to increase performance through tie exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates in non-numerical applications. The general trend has been toward supporting speculative execution in complicated, dynamically-scheduled processors. Performance, though, is more than just a high IPC rate; it also depends upon instruction count and cycle time. Boosting is an architectural technique that supports general speculative execution in simpler, statically-scheduled processors. Boosting labels speculative instructions with their control dependence information. This Iabelling eliminates control dependence constraints on instruction scheduling while still providing full dependence information to the hardwere. We have incorporated boosting into a trace-based, global scheduling algorithm that exploits ILP without adversely affecting the instruction count of a program. We use this algorithm and estimates of the boosting hardware involved to evaluate how much speculative execution support is rerdly necessary to achieve good performance. We find that a statically-scheduled superscalar processor using a minimal implementation of boosting can easily reach the performance of a much more complex dynamically-schcduled superscalar processor.
Design tradeoffs for software-managed TLBs
- ACM Transactions on Computer Systems
, 1993
"... An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties, which are highly dependent on the operating system’s structure and its use of virtual memory. This work explores software-managed TLB de ..."
Abstract
-
Cited by 70 (9 self)
- Add to MetaCart
An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties, which are highly dependent on the operating system’s structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of operating systems including monolithic and microkernel designs. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of Mach 3.0. Results: New operating systems are changing the relative frequency of different types of TLB misses, some of which may not be efficiently handled by current architectures. For the same application binaries, total TLB service time varies by as much as an order of magnitude under different operating systems. Reducing the handling cost for kernel TLB misses reduces total TLB service time up to 40%. For TLBs between 32 and 128 slots, each doubling of the TLB size reduces total TLB service time by as much as

