Results 1 
6 of
6
MatrixProduct on Heterogeneous MasterWorker Platforms
"... This paper is focused on designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses th ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
This paper is focused on designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes data and computations to the workers (while in ScaLAPACK, input and output matrices are supposed to be equally distributed among participating resources beforehand). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files). Heterogeneous starshaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources. Limited memory. As we investigate the parallelization of large problems, we cannot assume that full matrix column blocks can be stored in the worker memories and be reused for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on a platform at our site. The experiments show that our matrixproduct algorithm has smaller execution times than existing ones, while it also uses fewer resources.
Revisiting matrix product on masterworker platforms
, 2006
"... This paper is aimed at designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
This paper is aimed at designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLAPACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files). Heterogeneous starshaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources. Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and reused for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number mi of buffers, where a buffer can store a square block of matrix elements. The size q of these square blocks is chosen so as to harness the power of Level 3 BLAS routines: q = 80 or 100 on most platforms. We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at École Normale Supérieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms. 1 1
1 LIP, CNRSENS LyonINRIAUCBL
"... This paper is aimed at designing efficient parallel matrixproduct algorithms for homogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are two key hypotheses that rend ..."
Abstract
 Add to MetaCart
(Show Context)
This paper is aimed at designing efficient parallel matrixproduct algorithms for homogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are two key hypotheses that render our work original and innovative: Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLAPACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files). Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and reused for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number of buffers, where a buffer can store a square block of matrix elements. These square blocks are chosen so as to harness the power of Level 3 BLAS routines; they are of size 80 or 100 on most platforms. We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of MPI experiments conducted on a platform at the University of Tennessee. 1424409101/07/$20.00 c○2007 IEEE.
unknown title
"... Complexity analysis and performance evaluation of matrix product on multicore architectures ..."
Abstract
 Add to MetaCart
(Show Context)
Complexity analysis and performance evaluation of matrix product on multicore architectures
Multicore platform, Matrix product, Cache misses, Cacheaware algorithms.
"... platforms ..."
(Show Context)
unknown title
, 2009
"... Complexity analysis and performance evaluation of matrix product on multicore architectures ..."
Abstract
 Add to MetaCart
(Show Context)
Complexity analysis and performance evaluation of matrix product on multicore architectures