Results 11 - 20
of
25
Exact Dependence Analysis for Increased Communication Overlap ⋆
"... Abstract. MPI programs are often challenged to scale up to several million cores. In doing so, the programmer tunes every aspect of the application code. However, for large applications, this is often not practical and expensive tracing tools and post-mortem analysis are employed to guide the tuning ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. MPI programs are often challenged to scale up to several million cores. In doing so, the programmer tunes every aspect of the application code. However, for large applications, this is often not practical and expensive tracing tools and post-mortem analysis are employed to guide the tuning efforts finding hot-spots and performance bottlenecks. In this paper we revive the use of compiler analysis techniques to automatically unveil opportunities for communication/computation overlap using the result of exact data dependence analysis provided by the polyhedral model. We apply our technique to a 5-point stencil code showing performance improvements up to 28 % using 512 cores.
Initial Design of a Test Suite for Automatic Performance Analysis Tools
, 2002
"... Automatic performance tools must of course be tested as to whether they perform their task correctly. Because performance tools are meta-programs, tool testing is more complex than ordinary program testing, and comprises at least three aspects. First, it must be ensured that the tools do neither alt ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Automatic performance tools must of course be tested as to whether they perform their task correctly. Because performance tools are meta-programs, tool testing is more complex than ordinary program testing, and comprises at least three aspects. First, it must be ensured that the tools do neither alter the semantics nor distort the runtime behavior of the application under investigation. Next, it must be verified that the tools collect the correct performance data as required by their specification. Finally, it must be checked that the tools indeed perform their intended tasks: For badly performing applications, relevant performance problems must be detected and reported, and, on the other hand, tools should not diagnose performance problems for well-tuned programs without such problems. In short, performance tools should be semantics-preserving, complete and correct. Focusing on the correctness aspect, testing can be done by using synthetic test functions with controllable performance properties, and/or real world applications with known performance behavior. A systematic test suite can be built from such functions and other components, possibly with the help of tools to assist the user in putting the pieces together into executable test programs. Clearly,
Available task-level parallelism on the Cell BE
, 2009
"... Abstract. There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads the ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.
ePRO-MP: A tool for profiling and optimizing energy and performance of mobile multiprocessor applications
, 2009
"... Abstract. For mobile multiprocessor applications, achieving high performance with low energy consumption is a challenging task. In order to help programmers to meet these design requirements, system development tools play an important role. In this paper, we describe one such development tool, ePRO ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. For mobile multiprocessor applications, achieving high performance with low energy consumption is a challenging task. In order to help programmers to meet these design requirements, system development tools play an important role. In this paper, we describe one such development tool, ePRO-MP, which profiles and optimizes both performance and energy consumption of multi-threaded applications running on top of Linux for ARM11 MPCore-based embedded systems. One of the key features of ePRO-MP is that it can accurately estimate the energy consumption of multi-threaded applications without requiring a power measurement equipment, using a regression-based energy model. We also describe another key benefit of ePRO-MP, an automatic optimization function, using two example problems. Using the automatic optimization function, ePRO-MP can achieve high performance and low power consumption without programmer intervention. Our experimental results show that ePRO-MP can improve the performance and energy consumption by 6.1% and 4.1%, respectively, over a baseline version for the co-running applications optimization example. For the producer-consumer application optimization example, ePRO-MP improves the performance and energy consumption by 60.5% and 43.3%, respectively over a baseline version.
And
"... This paper proposes a generalised model of parallel performance. Our research investigates how far a generalised approach can obtain the same detailed results as current instantiated analysis of specific parallel algorithms. Performance is decomposed into the different overhead sources, that are exp ..."
Abstract
- Add to MetaCart
This paper proposes a generalised model of parallel performance. Our research investigates how far a generalised approach can obtain the same detailed results as current instantiated analysis of specific parallel algorithms. Performance is decomposed into the different overhead sources, that are expressed analytically and we showed how to separate the dependency of the algorithm and the system. We present a new method to overcome reported problems on measurability of overheads and this detailed analysis reveals ‘hidden ’ sources of blocking overhead, due to communication or extra computation. The presented approach can lead to an automated experimental analysis, in order to support performance prediction, straight forward development of parallel programs, and can be used in partitioning and load balancing algorithms. Therefore, it is focussed on the programmers viewpoint of parallel processing. 2. PARALLEL OVERHEAD Our approach starts by looking for the reasons of non-ideal performance: speedup less than the number of processors. Figure 1 shows the timeline of a typical parallel program on a message passing architecture, with its different phases: the partitioning of the work, the communication of data, the useful work of the processors job, the synchronisation and the induced blocking.
www.pssru.ac.uk Acknowledgements
, 2011
"... Impact of changes in length of stay on the demand for residential care services in England: Estimates from a dynamic microsimulation model ..."
Abstract
- Add to MetaCart
(Show Context)
Impact of changes in length of stay on the demand for residential care services in England: Estimates from a dynamic microsimulation model
Towards a Generalised Performance Analysis of Parallel Processing
"... This paper proposes a generalised model of parallel performance. Our research investigates how far a generalised approach can obtain the same detailed results as current instantiated analysis of specific parallel algorithms. Performance is decomposed into the different overhead sources, that are ex ..."
Abstract
- Add to MetaCart
This paper proposes a generalised model of parallel performance. Our research investigates how far a generalised approach can obtain the same detailed results as current instantiated analysis of specific parallel algorithms. Performance is decomposed into the different overhead sources, that are expressed analytically and we showed how to separate the dependency of the algorithm and the system. We present a new method to overcome reported problems on measurability of overheads and this detailed analysis reveals `hidden' sources of blocking overhead, due to communication or extra computation. The presented approach can lead to an automated experimental analysis, in order to support performance prediction, straight forward development of parallel programs, and can be used in partitioning and load balancing algorithms. Therefore, it is focussed on the programmers viewpoint of parallel processing.
SUMMARY
"... Performance engineering of parallel and distributed applications is a complex task that iterates through various phases, ranging from modeling and prediction, to performance measurement, experiment management, data collection, and bottleneck analysis. There is no evidence so far that all of these ph ..."
Abstract
- Add to MetaCart
(Show Context)
Performance engineering of parallel and distributed applications is a complex task that iterates through various phases, ranging from modeling and prediction, to performance measurement, experiment management, data collection, and bottleneck analysis. There is no evidence so far that all of these phases should/can be integrated in a single monolithic tool. Moreover, the emergence of computational Grids as a common single wide-area platform for high-performance computing raises the idea to provide performance tools and others as interacting Grid services that share resources, support interoperability among different users and tools, and most important provide omni-present performance functionality over the Grid. We have developed the ASKALON tool set [18] to support performance-oriented development of parallel and distributed (Grid) applications. ASKALON comprises four tools, coherently integrated into a Grid service-based distributed architecture. SCALEA is a performance instrumentation, measurement, and analysis tool of parallel and distributed applications. ZENTURIO is a general purpose experiment management tool with advanced support for multi-experiment performance analysis and parameter studies. AKSUM provides semi-automatic high-level performance bottleneck detection through a special-purpose performance property specification language. The PerformanceProphet enables the user to model and predict the performance of parallel applications at early development stages.
EVALUATION OF ASYNCHRONOUS SYSTEMS
, 2008
"... Designing and optimizing large-scale, asynchronous circuits is often an iterative process that cycles through synthesis, simulating, benchmarking, and program rewriting. Asynchronous circuits are usually specified by high-level, sequential or concurrent programs that prescribe the intended behavior. ..."
Abstract
- Add to MetaCart
Designing and optimizing large-scale, asynchronous circuits is often an iterative process that cycles through synthesis, simulating, benchmarking, and program rewriting. Asynchronous circuits are usually specified by high-level, sequential or concurrent programs that prescribe the intended behavior. The self-timed nature of the interface gives designers much freedom to refine and rewrite equivalent specifications for improved circuit synthesis. However, at any step in the design cycle, one faces an uncountable number of choices for program rewriting — one simply cannot afford to explore all possible transformations. Informed optimizations and design space pruning can require detailed knowledge of the run-time behavior of the program, which is what our simulation trace analysis infrastructure provides. Tracing entire simulations gives users the opportunity to understand program execution in great detail. Most importantly, trace profiling captures typical run-time behavior and input-dependent behavior that cannot always be inferred from static analysis. Profiling provides valuable feedback for optimizing both high-level transformations and low-level netlist synthesis.
Parallel Processing Letters c ○ World Scientific Publishing Company MODELING THE PERFORMANCE OF COMMUNICATION SCHEMES ON NETWORK TOPOLOGIES
, 2007
"... This paper investigates the influence of the interconnection network topology of a parallel system on the delivery time of an ensemble of messages, called the communication scheme. More specifically, we focus on the impact on the performance of structure in network topology and communication scheme. ..."
Abstract
- Add to MetaCart
(Show Context)
This paper investigates the influence of the interconnection network topology of a parallel system on the delivery time of an ensemble of messages, called the communication scheme. More specifically, we focus on the impact on the performance of structure in network topology and communication scheme. We introduce causal structure learning algorithms for the modeling of the communication time. The experimental data, from which the models are learned automatically, is retrieved from simulations. The qualitative models provide insight about which and how variables influence the communication performance. Next, a generic property is defined which characterizes the performance of individual communication schemes and network topologies. The property allows the accurate quantitative prediction of the runtime of random communication on random topologies. However, when either communication scheme or network topology exhibit regularities the prediction can become very inaccurate. The causal models can also differ qualitatively and quantitatively. Each combination of communication scheme regularity type, e.g. a one-to-all broadcast, and network topology regularity type, e.g. torus, possibly results in a different model which is based on different characteristics.