Results 1 -
7 of
7
Analytical bounds for optimal tile size selection
- In Proc. 21st Int. Conf. on Compiler Construction, CC’12
, 2012
"... Abstract. In this paper, we introduce a novel approach to guide tile size se-lection by employing analytical models to limit empirical search within a sub-space of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper, we introduce a novel approach to guide tile size se-lection by employing analytical models to limit empirical search within a sub-space of the full search space. Two analytical models are used together: 1) an existing conservative model, based on the data footprint of a tile, which ignores intra-tile cache block replacement, and 2) an aggressive new model that assumes optimal cache block replacement within a tile. Experimental results on multiple platforms demonstrate the practical effectiveness of the approach by reducing the search space for the optimal tile size by 1,307 × to 11,879 × for an Intel Core-2-Quad system; 358 × to 1,978 × for an Intel Nehalem system; and 45 × to 1,142× for an IBM Power7 system. The execution of rectangularly tiled code tuned by a search of the subspace identified by our model achieves speed-ups of up to 1.40 × (Intel Core-2 Quad), 1.28 × (Nehalem) and 1.19 × (Power 7) relative to the best possible square tile sizes on these different processor architectures. We also demonstrate the integration of the analytical bounds with existing search op-timization algorithms. Our approach not only reduces the total search time from Nelder-Mead Simplex and Parallel Rank Ordering methods by factors of up to 4.95 × and 4.33×, respectively, but also finds better tile sizes that yield higher performance in tuned tiled code. 1
Using Graph-Based Program Characterization for Predictive Modeling
"... Using machine learning has proven effective at choosing the right set of optimizations for a particular program. For machine learning techniques to be most effective, compiler writers have to develop expressive means of characterizing the program being optimized. The current state-of-the-art techniq ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Using machine learning has proven effective at choosing the right set of optimizations for a particular program. For machine learning techniques to be most effective, compiler writers have to develop expressive means of characterizing the program being optimized. The current state-of-the-art techniques for characterizing programs include using a fixed-length feature vector of either source code features extracted during compile time or performance counters collected when running the program. For the problem of identifying optimizations to apply, models constructed using performance counter characterizations of a program have been shown to outperform models constructed using source code features. However, collecting performance counters requires running the program multiple times, and this “dynamic ” method of characterizing programs can be specific to inputs of the program. It would be preferable to
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs
, 2011
"... Abstract. DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, whi ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract. DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows that our framework can find the tile sizes for these representative DOACROSS loops to achieve performances close to the best observed for a range of problem sizes tested. 1
Dynamic selection of tile sizes
- In Proceedings of the International Conference on High Performance Computing
, 2011
"... Abstract—Tiling is a key program transformation to achieve effective data reuse. But the performance of tiled programs can vary considerably with different tile sizes. Hence the selection of good tile sizes is crucial. Although there has been considerable research on analytical models for selecting ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Tiling is a key program transformation to achieve effective data reuse. But the performance of tiled programs can vary considerably with different tile sizes. Hence the selection of good tile sizes is crucial. Although there has been considerable research on analytical models for selecting tile sizes, they have not been shown to be effective in finding optimal tile sizes across a range of programs and target architectures. Auto-tuning is a viable alternative that is often used in practice, and involves the execution of different combinations of tile sizes in a systematic fashion to find the best ones. But this is sometimes infeasible — for instance when the program is to be run on unknown platforms (e.g., cloud environments). We propose a novel approach for generating code to enable dynamic tile size selection, based on monitoring the performance of a few loop iterations. The selection operates at run time on the “production ” run, without any a priori knowledge of the ex-ecution environment. We discuss the theory and implementation of a parametric tiled code generator that enables run-time tile size tuning and describe a search strategy to determine effective tile sizes. Experimental results demonstrate the effectiveness of the approach.
Neural Network Assisted Tile Size Selection
"... Abstract. Data locality optimization plays a significant role in reducing the execution time of many loop-intensive kernels. Loop tiling at various levels is often used to effectively exploit data locality in deep memory hierarchies. The recent development of frameworks for parametric loop tiling of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Data locality optimization plays a significant role in reducing the execution time of many loop-intensive kernels. Loop tiling at various levels is often used to effectively exploit data locality in deep memory hierarchies. The recent development of frameworks for parametric loop tiling of user code has lead to a widening of the range of applications that could benefit from auto-tuning of tile sizes. Current model-driven approaches suffer from limitations, such as the inability to accurately model the complex interplay between multiple hardware components that affect performance. Auto-tuning libraries such as ATLAS rely on extensive empirical search for tile size optimization, which has been shown to be very effective. However, the effectiveness of such approaches for arbitrary parametrically tiled user code has not been demonstrated. We consider the problem of selecting the best tile sizes for arbitrary user-defined programs, by sampling in the full space of tile sizes. We have developed a technique to build a performance predictor associated with a specific program. Our approach uses statistical machine learning to train an artificial neural network (ANN) to predict the performance distribution of execution time for scientific kernels. We show how this search strategy significantly improves over the variability of random search. Our observations and results on various kernels also show promise for the use of ANNs in predicting the runtime behavior for variations of tiling configurations. 1
An Improved Machine Learning Approach for Selecting a Polyhedral Model Transformation
"... Algorithms in fields like image manipulation, signal processing, and statistics frequently employ tight CPU-bound loops, whose per-formance is highly dependent on efficient utilization of the CPU and memory bus. The polyhedral model allows the automatic generation of loop nest transformations that a ..."
Abstract
- Add to MetaCart
Algorithms in fields like image manipulation, signal processing, and statistics frequently employ tight CPU-bound loops, whose per-formance is highly dependent on efficient utilization of the CPU and memory bus. The polyhedral model allows the automatic generation of loop nest transformations that are semantically equivalent to the original. The challenge, however, is to select the transformation that gives the highest performance on a given architecture. In this paper, we present an improved machine learning approach to select the best transformation. Our approach can be used as a stand-alone method that yields accuracy comparable to the best previous approach but offers a substantially faster selection process. As well, our approach can be combined with the best previous approach into a higher level selection process that is more accurate than either method alone. Compared to prior work, the key distinguishing characteristics to our approach are formulating the problem as a classification problem rather than a regression problem, using static structural features in addition to dynamic performance counter features, performing feature selection, and using ensemble methods to boost the performance of the classifier.