Results 1 
6 of
6
Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems”, Proceedings of the 3 rd International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks (HeteroPar’04
 MS in mathematics and engineering from the Moscow Aviation Institute in 1980 and his PhD in engineering from HeavyMachinery Research Institute in
"... Abstract — In this paper the possibility of including automatic optimization techniques in the design of parallel dynamic programming algorithms in heterogeneous systems is analyzed. The main idea is to automatically approach the optimum values of a number of algorithmic parameters (number of proces ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
(Show Context)
Abstract — In this paper the possibility of including automatic optimization techniques in the design of parallel dynamic programming algorithms in heterogeneous systems is analyzed. The main idea is to automatically approach the optimum values of a number of algorithmic parameters (number of processes, number of processors, processes per processor), and thus obtain low execution times. Hence, users could be provided with routines which execute efficiently, and independently of the experience of the user in heterogeneous computing and dynamic programming, and which can adapt automatically to a new network of processors or a new network configuration. I.
Optimal Number of Nodes for Computation in Grid Environments
 In: 12th EuroMicro Conf. PDP04. (2004
, 2003
"... In this paper we show that there exists an optimal number of nodes to be assigned to jobs for execution in grid systems, which depends on the distributions of computation and communication service times. We also show that the analytical models proposed for parallel computers are not accurate for g ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
In this paper we show that there exists an optimal number of nodes to be assigned to jobs for execution in grid systems, which depends on the distributions of computation and communication service times. We also show that the analytical models proposed for parallel computers are not accurate for grid systems. We therefore extend to grid environment the definitions of speedup, efficiency and efficacy that are usually given for parallel systems. We also adopt a queueing network model with three different types of workload to prove that in every case an optimal number of nodes exists and that the mean value of CPU and communication service times is just a scale factor for this optimum.
Tuning Basic Linear Algebra Routines for Hybrid CPU+GPU Platforms
"... Abstract The introduction of autotuning techniques in linear algebra routines using hybrid combinations of multiple CPU and GPU computing resources is analyzed. Basic models of the execution time and information obtained during the installation of the routines are used to optimize the execution ti ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract The introduction of autotuning techniques in linear algebra routines using hybrid combinations of multiple CPU and GPU computing resources is analyzed. Basic models of the execution time and information obtained during the installation of the routines are used to optimize the execution time with a balanced assignation of the work to the computing components in the system. The study is carried out with a basic kernel (matrixmatrix multiplication) and a higher level routine (LU factorization) using GPUs and the host multicore processor. Satisfactory results are obtained, with experimental execution times close to the lowest experimentally achievable.
Author manuscript, published in "17th Euromicro International Conference on Parallel, Distributed, and NetworkBased Processing
, 2008
"... On the use of performance models for adaptive algorithm selection on heterogeneous clusters ..."
Abstract
 Add to MetaCart
(Show Context)
On the use of performance models for adaptive algorithm selection on heterogeneous clusters
Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous Clusters
"... This paper presents a selfoptimization methodology for parallel linear algebra routines on heterogeneous systems. For each routine, a series of decisions is taken automatically in order to obtain an execution time close to the optimum (without rewriting the routine’s code). Some of these decisions ..."
Abstract
 Add to MetaCart
(Show Context)
This paper presents a selfoptimization methodology for parallel linear algebra routines on heterogeneous systems. For each routine, a series of decisions is taken automatically in order to obtain an execution time close to the optimum (without rewriting the routine’s code). Some of these decisions are: the number of processes to generate, the heterogeneous distribution of these processes over the network of processors, the logical topology of the generated processes,... To reduce the searching space of such decisions, different heuristics have been used. The experiments have been performed with a parallel LU factorization routine similar to the ScaLAPACK one, and good results have been obtained on different heterogeneous platforms.
Performance Prediction through Time Measurements
"... Abstract. In this article we address the problem of predicting performance of linear algebra algorithms for small matrices. This approach is based on reducing the performance prediction to modeling the execution time of algorithms. The execution time of higher level algorithms like the LU factorizat ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. In this article we address the problem of predicting performance of linear algebra algorithms for small matrices. This approach is based on reducing the performance prediction to modeling the execution time of algorithms. The execution time of higher level algorithms like the LU factorization is predicted through modeling the computational time of the kernel linear algebra operations such as the BLAS subroutines. As the time measurements confirmed, the execution time of the BLAS subroutines has a piecewisepolynomial behavior. Therefore, the subroutines time is modeled by conducting only few samples and then applying polynomial interpolation. The validation of the approach is established by comparing the predicted execution time of the unblocked LU factorization, which is built on top of two BLAS subroutines, with the separately measured one. The applicability of the approach is illustrated through performance experiments on Intel and AMD processors.