Results 1 - 10
of
115
Data Transformations for Eliminating Conflict Misses
- In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation
, 1998
"... Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses ..."
Abstract
-
Cited by 134 (12 self)
- Add to MetaCart
(Show Context)
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PadLite only uses array and column dimension sizes, relying on assumptions about common array reference patterns. Pad analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PadLite can eliminate conflicts for benchmarks, but Pad is more effective over a range of cache and problem sizes. Padding reduces c...
Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity
- In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler- or programmerapplied optimizations like data structure padding, matrix blocking, and ot ..."
Abstract
-
Cited by 87 (1 self)
- Add to MetaCart
(Show Context)
Analyzing and optimizing program memory performance is a pressing problem in high-performance computer architectures. Currently, software solutions addressing the processormemory performance gap include compiler- or programmerapplied optimizations like data structure padding, matrix blocking, and other program transformations. Compiler optimization can be effective, but the lack of precise analysis and optimization frameworks makes it impossible to confidently make optimal, rather than heuristic-based, program transformations. Imprecision is most problematic in situations where hard-to-predict cache conflicts foil heuristic approaches. Furthermore, the lack of a general framework for compiler memory performance analysis makes it impossible to understand the combined effects of several program transformations. The Cache Miss Equation (CME) framework discussed in this paper addresses these issues. We express memory reference and cache conflict behavior in terms of sets of equations. The ...
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 76 (5 self)
- Add to MetaCart
(Show Context)
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
Tiling Optimizations for 3D Scientific Computations
, 2000
"... Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cann ..."
Abstract
-
Cited by 69 (4 self)
- Add to MetaCart
(Show Context)
Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cachefor larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17--121% for key scientific kernels, including a 27% average improvement for the key computational loop nest in the SPEC/NAS benchmark MGRID.
Synthesizing transformations for locality enhancement of imperfectly-nested loop nests
- In Proceedings of the 2000 ACM International Conference on Supercomputing
, 2000
"... We present an approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop ..."
Abstract
-
Cited by 64 (3 self)
- Add to MetaCart
We present an approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of every statement in a loop nest into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like code sinking and loop fusion that are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space is then transformed further to enhance locality, after which fully permutable loops are tiled, and code is generated. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 1. BACKGROUND AND PREVIOUSWORK Sophisticated algorithms based on polyhedral algebra have been developed for determining good sequences of linear loop transformations (permutation, skewing, reversal and scaling) for enhancing locality in perfectly-nested loops 1. Highlights of this technology are the following. The iterations of the loop nest are modeled as points in an integer lattice, and linear loop transformations are modeled as nonsingular matrices mapping one lattice to another. A sequence of loop transformations is modeled by the product of matrices representing the individual transformations; since the set of nonsingular matrices is closed under matrix product, this means that a sequence of linear loop transformations can be represented by a nonsingular matrix. The problem of finding an optimal sequence of linear loop transformations is thus reduced to the problem of finding an integer matrix that satisfies some desired property, permitting the full machinery of matrix methods and lattice theory to ¢ This work was supported by NSF grants CCR-9720211, EIA-9726388, ACI-9870687,EIA-9972853. £ A perfectly-nested loop is a set of loops in which all assignment statements are contained in the innermost loop.
A Comparison of Compiler Tiling Algorithms
, 1999
"... Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with exis ..."
Abstract
-
Cited by 55 (8 self)
- Add to MetaCart
Linear algebra codes contain data locality which can be exploited by tiling multiple loop nests. Several approaches to tiling have been suggested for avoiding conflict misses in low associativity caches. We propose a new technique based on intra-variable padding and compare its performance with existing techniques. Results show padding improves performance of matrix multiply by over 100 % in some cases over a range of matrix sizes. Comparing the efficacy of different tiling algorithms, we discover rectangular tiles are slightly more efficient than square tiles. Overall, tiling improves performance from 0-250%. Copying tiles at run time proves to be quite effective.
Static timing analysis of embedded software,” in
- Proc. Design Automation Conf.,
, 1997
"... Abstract This paper examines the problem of statically analyzing the performance of embedded software. This problem is motivated by the increasing growth of embedded systems and a lack of appropriate analysis tools. We study dierent performance metrics that need to be considered in this context and ..."
Abstract
-
Cited by 52 (0 self)
- Add to MetaCart
(Show Context)
Abstract This paper examines the problem of statically analyzing the performance of embedded software. This problem is motivated by the increasing growth of embedded systems and a lack of appropriate analysis tools. We study dierent performance metrics that need to be considered in this context and examine a range of techniques that have been proposed for analysis. Very broadly these can be classied into path analysis and system utilization analysis techniques. It is observed that these are interdependent, and thus need to be considered together in any analysis framework. 1 The Emergence of Embedded Systems Embedded systems are characterized by the presence of processors running application specic programs. Typical examples include printers, cellular phones, automotive engine controller units, etc. A key dierence between an embedded system and a general-purpose computer is that the software in the embedded system is part of the system specication and does not change once the system is shipped to the end user. Recent y ears have seen a large growth of embedded systems. The migration from application specic logic to application specic code running on processors is driven by the demands of more complex system features, lower system cost and shorter development cycles. These can be better met with software programmable solutions made possible by embedded systems. Two distinct points are responsible for this. Flexibility of Software Software is easier to develop and is more exible than hardware. It can implement more complex algorithms. By using dierent software versions, a family of products based on same hardware can be developed to target dierent market segments, reducing both hardware cost and design time. Software permits the designer to enhance the system features quickly so as to suit the end users' changing requirements and to dierentiate the product from its competitors. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. 0-89791-847-9/97/0006/$3.50 DAC 97 -06/97 Anaheim, CA, USA Increasing Integration Densities The increase in integration densities makes available 1-10 Million transistors on a single IC today. With these resources, the notion of a \system on a chip" is becoming a viable implementation technology. This integrates processors, memory, peripherals and a gate array ASIC on a single IC. This high level of integration reduces size, power consumption and cost of the system. The programmable component of the design increases the applicability of the design and thus the sales volume, amortizing high manufacturing setup costs. Less reusable application specic logic is getting increasingly expensive to develop and manufacture and is the solution only when speed constraints rule out programmable solutions. The pull eect oered by exibility of software and the push eect from increasingly expensive application specic logic solutions make e m bedded systems an attractive solution. As system complexity grows and microprocessor performance increases, the embedded system design approach for application specic systems is becoming more appealing. Thus, we are seeing a movement from the logic gate being the basic unit of computation on silicon, to an instruction running on an embedded processor. This motivates research eorts in the analysis of embedded software. Our capabilities as researchers and tool developers to model, analyze and optimize the gate component of the design must now b e e xtended to handle the embedded software component. This paper examines one such aspect for embedded software { techniques for statically analyzing the timing behavior (i.e., the performance) of embedded software. By static analysis, w e refer to techniques that use results of information collected at or before compile time. This may include information collected in proling runs of the code executed before the nal compilation. In contrast, dynamic performance analysis refers to on-the-y performance monitoring while the embedded software is installed and running. We limit the scope of this paper by considering only single software components, i.e. the execution of a single program on a known processor. The analysis of multiple processes belongs to the larger eld of system level performance analysis. We start by examining the various performance metrics of interest in Section 2. Next, we look at the dierent applications of performance analysis in Section 3. In Section 4 we examine the dierent components that make this analysis task dicult, and for each w e summarize the analysis techniques that are described in existing literature. Finally, in Section 5 we conclude and point out interesting future directions for research. 2 Performance Metrics for Embedded Software Extreme Case Performance Embedded systems generally interact with the outside world. This may i n v olve measuring sensors and controlling actuators, communicating with other systems, or interacting with users. These tasks may h a v e t o be performed at precise times. A system with such timing
Analytical modeling of set-associative cache behavior
- IEEE Trans. Comput
, 1999
"... ..."
(Show Context)
Tuning Strassen's Matrix Multiplication for Memory Efficiency
- IN PROCEEDINGS OF SC98 (CD-ROM
, 1998
"... Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of thi ..."
Abstract
-
Cited by 41 (5 self)
- Add to MetaCart
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non-standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms th...
Modulo Scheduling for a Fully-Distributed Clustered . . .
, 2000
"... Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. In this work we propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo sche ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
Clustering is an approach that many microprocessors are adopting in recent times in order to mitigate the increasing penalties of wire delays. In this work we propose a novel clustered VLIW architecture which has all its resources partitioned among clusters, including the cache memory. A modulo scheduling scheme for this architecture is also proposed. This algorithm takes into account both register and memory inter-cluster communications so that the final schedule results in a cluster assignment that favors cluster locality in cache references and register accesses. It has been evaluated for both 2- and 4-cluster configurations and for differing number and latencies of inter-cluster buses. The proposed algorithm produces schedules with very low communication requirements and outperforms previous cluster-oriented schedulers. 1. Introduction Technology projections point to wire delays as being one of the main hurdles for improving instruction throughput of future microprocessors [23]. ...