Results 1 - 10
of
10
Feature selection and policy optimization for distributed instruction placement using reinforcement learning
- in The 17th International Conference on Parallel Architectures and Compilation Techniques
, 2008
"... Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
Communication overheads are one of the fundamental challenges in a multiprocessor system. As the number of processors on a chip increases, communication overheads and the distribution of computation and data become increasingly important performance factors. Explicit Dataflow Graph Execution (EDGE) processors, in which instructions communicate with one another directly on a distributed substrate, give the compiler control over communication overheads at a fine granularity. Prior work shows that compilers can effectively reduce fine-grained communication overheads in EDGE architectures using a spatial instruction placement algorithm with a heuristic-based cost function. While this algorithm is effective, the cost function must be painstakingly tuned. Heuristics tuned to perform well across a variety of applications leave users with little ability to tune
Identifying Energy-Efficient Concurrency Levels Using Machine Learning
"... Abstract — Multicore microprocessors have been largely motivated by the diminishing returns in performance and the increased power consumption of single-threaded ILP microprocessors. With the industry already shifting from multicore to many-core microprocessors, software developers must extract more ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract — Multicore microprocessors have been largely motivated by the diminishing returns in performance and the increased power consumption of single-threaded ILP microprocessors. With the industry already shifting from multicore to many-core microprocessors, software developers must extract more thread-level parallelism from applications. Unfortunately, low power-efficiency and diminishing returns in performance remain major obstacles with many cores. Poor interaction between software and hardware, and bottlenecks in shared hardware structures often prevent scaling to many cores, even in applications where a high degree of parallelism is potentially available. In some cases, throwing additional cores at a problem may actually harm performance and increase power consumption. Better use of otherwise limitedly beneficial cores by software components such as hypervisors and operating systems can improve system-wide performance and reliability, even in cases where power consumption is not a main concern. In response to these observations, we evaluate an approach to throttle concurrency in parallel programs dynamically. We throttle concurrency to levels with higher predicted efficiency from both performance and energy standpoints, and we do so via machine learning, specifically artificial neural networks (ANNs). One advantage of using ANNs over similar techniques previously explored is that the training phase is greatly simplified, thereby reducing the burden on the end user. Using machine learning in the context of concurrency throttling is novel. We show that ANNs are effective for identifying energy-efficient concurrency levels in multithreaded scientific applications, and we do so using physical experimentation on a state-of-the-art quad-core Xeon platform. I.
Automatic feature generation for setting compilers heuristics
- In 2nd Workshop on Statistical and Machine Learning Approaches to Architectures and Compilation
, 2008
"... Heuristics in compilers are often designed by manually analyzing sample programs. Recent advances have successfully applied machine learning to automatically generate heuristics. The typical format of these approaches reduces the input loops, functions or programs to a finite vector of features. A m ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Heuristics in compilers are often designed by manually analyzing sample programs. Recent advances have successfully applied machine learning to automatically generate heuristics. The typical format of these approaches reduces the input loops, functions or programs to a finite vector of features. A machine learning algorithm then learns a mapping from these features to the desired heuristic parameters. Choosing the right features is important and requires expert knowledge since no machine learning tool will work well with poorly chosen features. This paper introduces a novel mechanism to generate features. Grammars describing languages of features are defined and from these grammars sentences are randomly produced. The features are then evaluated over input data and computed values are given to machine learning tools. We propose the construction of domain specific feature languages for different purposes in different parts of the compiler. Using these feature languages, complex, machine generated features are extracted from program code. Using our observation that some functions can benefit from setting different compiler options, while others cannot, we demonstrate the use of a decision tree classifier to automatically identify the former using the automatically generated features. We show that our method outperform human generated features on problems of loop unrolling and phase ordering, achieving a statistically significant decrease in run-time compared to programs compiled using GCC’s heuristics. 2
Applying support vector machines to discover method-specific compilation strategies
, 2010
"... Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is converted to, or otherwise made available in digital form, the University of Alberta will advise potential users of the thesis of these terms. The author reserves all other publication and other rights in association with the copyright in the thesis and, except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatsoever without the author’s prior written permission. Examining Committee
ExploringandPredictingthe Architecture/OptimisingCompilerCo-DesignSpace
"... Embedded processor performance is dependent on both the underlying architecture and the compiler optimisations applied. However, designing both simultaneously is extremely difficult to achieve due to the time constraints designers must work under. Therefore, current methodology involves designing co ..."
Abstract
- Add to MetaCart
Embedded processor performance is dependent on both the underlying architecture and the compiler optimisations applied. However, designing both simultaneously is extremely difficult to achieve due to the time constraints designers must work under. Therefore, current methodology involves designing compiler and architecture in isolation, leading to sub-optimal performance of the final product. This paper develops a novel approach to this co-design space problem. For any microarchitectural configuration we automatically predict the performance that an optimising compiler would achieve without actually building it. Once trained, a single run of-O1 on the new architecture is enough to make a prediction with just a 1.6 % error rate. This allows the designer to accurately choose an architectural configuration with knowledge of how an optimising compiler will perform on it. We use this to find the best optimising compiler/architectural configuration in our co-design space and demonstrate that it achieves an average 13 % performance improvement and energy savings of 23 % compared to the baseline, leading to an energy-delay (ED) value of 0.67.
Team Alchemy Architectures, Languages and Compilers to Harness the End of Moore Years
"... c t i v it y e p o r t ..."
Using Machines to Learn Method-Specific Compilation Strategies
"... Support Vector Machines (SVMs) are used to discover method-specific compilation strategies in Testarossa, a commercial Just-in-Time (JiT) compiler employed in the IBM R ○ J9 Java TM Virtual Machine. The learning process explores a large number of different compilation strategies to generate the data ..."
Abstract
- Add to MetaCart
Support Vector Machines (SVMs) are used to discover method-specific compilation strategies in Testarossa, a commercial Just-in-Time (JiT) compiler employed in the IBM R ○ J9 Java TM Virtual Machine. The learning process explores a large number of different compilation strategies to generate the data needed for training models. The trained machine-learned model is integrated with the compiler to predict a compilation plan that balances code quality and compilation effort on a per-method basis. The machinelearned plans outperform the original Testarossa for start-up performance, but not for throughput performance, for which Testarossa has been highly hand-tuned for many years. 1.
ThreadMarks: A Framework for Input-Aware Prediction of Parallel Application Behavior
"... Chip-multiprocessors (CMPs) are quickly becoming entrenched as the main-stream architectural platform in computer systems. One of the critical challenges facing CMPs is designing applications to effectively leverage the computational resources they provide. Modifying applications to effectively run ..."
Abstract
- Add to MetaCart
Chip-multiprocessors (CMPs) are quickly becoming entrenched as the main-stream architectural platform in computer systems. One of the critical challenges facing CMPs is designing applications to effectively leverage the computational resources they provide. Modifying applications to effectively run on CMPs requires understanding the bottlenecks in applications, which necessitates a detailed understanding of architectural features. Unfortunately, identifying bottlenecks is complex and often requires enumerating a wide range of behaviors. To assist in identifying bottlenecks, this paper presents a framework for developing analytical models based on dynamic program behaviors. That is, given a program and set of training inputs, the framework will generate several analytical models that accurately predict online program behaviors such as memory utilization and synchronization overhead, while taking program input into consideration. These models can prove invaluable for online optimization systems and input-specific analysis of program behavior. We demonstrate that this framework is practical and accurate on a wide range of synthetic and real-world parallel applications over various workloads. 1
Approximate Graph Clustering for Program Characterization
, 2012
"... An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data graphs or control flow graphs. Our novel approximate ..."
Abstract
- Add to MetaCart
An important aspect of system optimization research is the discovery of program traits or behaviors. In this paper, we present an automated method of program characterization which is able to examine and cluster program graphs, i.e., dynamic data graphs or control flow graphs. Our novel approximate graph clustering technology allows users to find groups of program fragments which contain similar code idioms or patterns in data reuse, control flow, and context. Patterns of this nature have several potential applications including development of new static or dynamic optimizations to be implemented in software or in hardware. For the SPEC CPU 2006 suite of benchmarks, our results show that approximate graph clustering is effective at grouping behaviorally similar functions. Graph based clustering also produces clusters that are more homogeneous than previously proposed non-graph based clustering methods. Further qualitative analysis of the clustered functions shows that our approach is also able to identify some frequent unexploited program behaviors. These results suggest that our approximate graph clustering methods could be very useful for program characterization.
Author manuscript, published in "International Symposium on Code Generation and Optimization (CGO'11) (2011)" Predictive Modeling in a Polyhedral Optimization Space
, 2011
"... Abstract—Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a majo ..."
Abstract
- Add to MetaCart
Abstract—Significant advances in compiler optimization have been made in recent years, enabling many transformations such as tiling, fusion, parallelization and vectorization on imperfectly nested loops. Nevertheless, the problem of finding the best combination of loop transformations remains a major challenge. Polyhedral models for compiler optimization have demonstrated strong potential for enhancing program performance, in particular for compute-intensive applications. But existing static cost models to optimize polyhedral transformations have significant limitations, and iterative compilation has become a very promising alternative to these models to find the most effective transformations. But since the number of polyhedral optimization alternatives can be enormous, it is often impractical to iterate over a significant fraction of the entire space of polyhedrally transformed variants. Recent research has focused on iterating over this search space either with manually-constructed heuristics or with automatic but very expensive search algorithms (e.g., genetic algorithms) that can eventually find good points in the polyhedral space. In this paper, we propose the use of machine learning to address the problem of selecting the best polyhedral optimizations. We show that these models can quickly find high-performance program variants in the polyhedral space, without resorting to extensive empirical search. We introduce models that take as input a characterization of a program based on its dynamic behavior, and predict the performance of aggressive high-level polyhedral transformations that includes tiling, parallelization and vectorization. We allow for a minimal empirical search on the target machine, discovering on average 83 % of the searchspace-optimal combinations in at most 5 runs. Our end-to-end framework is validated using numerous benchmarks on two multi-core platforms. I.

