Results 1 - 10
of
19
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
Parallelized direct execution simulation of message-passing parallel programs
- IEEE Transactions on Parallel and Distributed Systems
, 1996
"... As massively parallel computers proliferate, there is growing interest in �nding ways by which performance of massively parallel codes can be e�ciently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm developm ..."
Abstract
-
Cited by 42 (10 self)
- Add to MetaCart
As massively parallel computers proliferate, there is growing interest in �nding ways by which performance of massively parallel codes can be e�ciently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, speci�cally the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE �Large Application Parallel Simulation Environment�, wehave built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10 � relative error. Depending on the nature of the application code, we have observed low slowdowns �relative to natively executing code � and high relative speedups using up to 64 processors.
SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture
, 1999
"... New instruction-set architectures (ISAs) live or die depending on how quickly they develop a large software base. This paper describes SoftSDV, a presilicon software -development environment that has enabled at least eight commercial operating systems and numerous large applications to be ported and ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
New instruction-set architectures (ISAs) live or die depending on how quickly they develop a large software base. This paper describes SoftSDV, a presilicon software -development environment that has enabled at least eight commercial operating systems and numerous large applications to be ported and tuned to IA-64, well in advance of Itanium^TM processor's first silicon. IA-64 versions of Microsoft Windows 2000 and Trillian Linux* that were developed on SoftSDV booted within ten days of the availability of the Itanium processor.
METRIC: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting
, 2003
"... In this paper, we present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions in this ..."
Abstract
-
Cited by 22 (12 self)
- Add to MetaCart
In this paper, we present METRIC, an environment for determining memory inefficiencies by examining data traces. METRIC is designed to alter the performance behavior of applications that are mostly constrained by their latency to resolve memory references. We make four primary contributions in this paper. First, we present methods to extract partial data traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial data traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial data traces. By exploiting summarized memory metrics, by-reference metrics as well as cache evictor information, we can pin-point the sources of performance problems. Fourth, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several experiments resulting in up to 40% lower miss ratios.
Partial Data Traces: Efficient Generation and Representation
, 2001
"... Binary manipulation techniques are increasing in popularity. They support program transformations tailored toward certain program inputs, and these transformations have been shown to yield performance gains beyond the scope of static code optimizations without prole-directed feedback. They even deli ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Binary manipulation techniques are increasing in popularity. They support program transformations tailored toward certain program inputs, and these transformations have been shown to yield performance gains beyond the scope of static code optimizations without prole-directed feedback. They even deliver moderate gains in the presence of prole-guided optimizations. In addition, transformations can be performed on the entire executable, including library routines. This work focuses on program instrumentation, yet another application of binary manipulation. This paper reports preliminary results on generating partial data traces through dynamic binary rewriting. The contributions are threefold. First, a portable method for extracting precise data traces for partial executions of arbitrary applications is developed. Second, a set of hierarchical structures for compactly representing these accesses is developed. Third, an e- cient online algorithm to detect regular accesses is introduced. These eorts are part of a larger project to counter the increasing gap between processor and main memory speeds by means of software optimization and hardware enhancements. 1.
Detailed Cache Coherence Characterization for OpenMP Benchmarks
- IN INTERNATIONAL CONFERENCE ON SUPERCOMPUTING. 287–297
, 2004
"... Past work on studying cache coherence in shared-memory symmetric multiprocessors (SMPs) concentrates on studying aggregate events, often from an architecture point of view. However, this approach provides insufficient information about the exact sources of inefficiencies in parallel applications. Fo ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Past work on studying cache coherence in shared-memory symmetric multiprocessors (SMPs) concentrates on studying aggregate events, often from an architecture point of view. However, this approach provides insufficient information about the exact sources of inefficiencies in parallel applications. For SMPs in contemporary clusters, application performance is impacted by the pattern of shared memory usage, and it becomes essential to understand coherence behavior in terms of the application program constructs --- such as data structures and source code lines. The
Analytical Computation of Ehrhart Polynomials and its Application in Compile-Time Generated Cache Hints
, 2004
"... In modern micro-architectures, computation speed is often reduced by cache misses. Cache analysis is therefore imperative to obtain e#ective optimization. We present an analytical technique based on reuse distances that focuses on e#ciently determining the behavior of fully associative caches and ex ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In modern micro-architectures, computation speed is often reduced by cache misses. Cache analysis is therefore imperative to obtain e#ective optimization. We present an analytical technique based on reuse distances that focuses on e#ciently determining the behavior of fully associative caches and extends to set-associative caches. In this technique, the number of cache misses is obtained by counting the number of integer points in a parameterized polytope.
Cache conscious programming in undergraduate computer science
- In SIGCSE
, 1999
"... performance potential of fast processors, programmers The wide-spread use of microprocessor based systems must explicitly consider cache behavior, restructuring their that utilize cache memory to alleviate excessively long codes to increase locality. DRAM access times introduces a new dimension in t ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
performance potential of fast processors, programmers The wide-spread use of microprocessor based systems must explicitly consider cache behavior, restructuring their that utilize cache memory to alleviate excessively long codes to increase locality. DRAM access times introduces a new dimension in the As fast processors proliferate, techniques for improving quest to obtain good program performance. To fully cache performance must move beyond the supercomputer, exploit the performance potential of these fast processors, multiprocessor, and academic research communities and programmers must reason about their program’s cache into the mainstream of computing. To expedite this trans-performance. Heretofore, this topic has been restricted to fer of knowledge, as part of the CURIOUS (Center for the supercomputer, multiprocessor, and academic research Undergraduate education and Research: Integration community. It is now time to introduce this topic into thrOUgh performance and viSualization) project at Duke undergraduate computer science curriculum.
Tools and Techniques for Memory System Design and Analysis
, 1995
"... As processor cycle times decrease, memory system performance becomes ever more critical to overall performance. Continually changing technology and workloads create a moving target for computer architects in their effort to design cost-effective memory systems. Meeting the demands of ever changing w ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
As processor cycle times decrease, memory system performance becomes ever more critical to overall performance. Continually changing technology and workloads create a moving target for computer architects in their effort to design cost-effective memory systems. Meeting the demands of ever changing workloads and technology requires the following: . Efficient techniques for evaluating memory system performance, . Tuning programs to better use the memory system, and . New memory system designs. This thesis makes contributions in each of these areas. Hardware and software developers rely on simulation to evaluate new ideas. In this thesis, I present a new interface for writing memory system simulators---the active memory abstraction---designed specifically for simulators that process memory references as the application executes and avoids storing them to tape or disk. Active memory allows simulators to optimize for the common case, e.g., cache hits, achieving simulation times only 2-6 t...
METRIC: Tracking memory bottlenecks via binary rewriting
, 2003
"... (Under the direction of Assistant Professor Dr. Frank Mueller). Over recent decades, computing speeds have grown much faster than memory ac-cess speeds. This differential rate of improvement between processor speeds and memory speeds has led to an ever-increasing processor-memory gap. Overall comput ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Under the direction of Assistant Professor Dr. Frank Mueller). Over recent decades, computing speeds have grown much faster than memory ac-cess speeds. This differential rate of improvement between processor speeds and memory speeds has led to an ever-increasing processor-memory gap. Overall computing speeds for for most applications are now dominated by the cost of their memory references. Further-more, memory access costs will grow increasingly dominant as the processor-memory gap widens. In this scenario, characterizing and quantifying application program memory usage to isolate, identify and eliminate memory access bottlenecks will have significant impact on overall application computing performance. This thesis presents METRIC, an environment for determining memory access inef-ficiencies by examining access traces. This thesis makes three primary contributions. First, we present methods to extract partial access traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial access traces in constant space for regular references through a novel tech-nique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial access traces. By examining summarized and by-reference metrics as well as cache evictor information, we can pinpoint the sources of performance problems. We perform validation experiments of the framework with respect to accuracy, compression and execution overheads for several benchmarks. Finally, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several case studies, resulting in up to 40 % lower miss ratios.

