## MatrixProduct on Heterogeneous Master-Worker Platforms

Citations: | 7 - 6 self |

### BibTeX

@MISC{Dongarra_matrixproducton,

author = {Jack Dongarra and Jean-françois Pineau and Yves Robert and Frédéric Vivien},

title = {MatrixProduct on Heterogeneous Master-Worker Platforms},

year = {}

}

### OpenURL

### Abstract

This paper is focused on designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative:- Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes data and computations to the workers (while in ScaLAPACK, input and output matrices are supposed to be equally distributed among participating resources beforehand). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files).- Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources.- Limited memory. As we investigate the parallelization of large problems, we cannot assume that full matrix column blocks can be stored in the worker memories and be re-used for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on a platform at our site. The experiments show that our matrix-product algorithm has smaller execution times than existing ones, while it also uses fewer resources.

### Citations

352 |
ScaLAPACK Users’ Guide
- Blackford, Choi, et al.
- 1997
(Show Context)
Citation Context ...el in many scientific applications and it has been extensively studied on parallel architectures. Two well-known parallel versions are Cannon’s algorithm [6] and the ScaLAPACK outer product algorithm =-=[5]-=-. Typically, parallel implementations work well on 2D processor grids, because the input matrices are sliced horizontally and vertically into square blocks that are mapped one-to-one onto the physical... |

291 | Automated empirical optimizations of software and the atlas project
- Whaley, Petitet, et al.
- 2000
(Show Context)
Citation Context ...ficients but instead square blocks of size q × q (hence with q 2 coefficients). This is to harness the power of Level 3 BLAS routines [5]. Typically, q = 80 or 100 when using ATLAS-generated routines =-=[7]-=-. • The input matrix A is of size nA × nAB: - we split A into r horizontal stripes Ai, 1 ≤ i ≤ r, where r = nA/q; - we split each stripe Ai into t square q × q blocks Ai,k, 1 ≤ k ≤ t, where t = nAB/q.... |

166 |
H.T.: I/O complexity: The red-blue pebble game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...entioned, the design of parallel algorithms for limited memory processors is very similar to the design of out-of-core routines for classical parallel machines. On the theoretical side, Hong and Kung =-=[9]-=- investigate the I/O complexity of several computational kernels in their pioneering paper. Toledo [17] proposes a nice survey on the design of out-of-core algorithms for linear algebra, including den... |

130 |
A cellular computer to implement the Kalman filter algorithm
- Cannon
- 1969
(Show Context)
Citation Context ...ion Matrix product is a key computational kernel in many scientific applications and it has been extensively studied on parallel architectures. Two well-known parallel versions are Cannon’s algorithm =-=[6]-=- and the ScaLAPACK outer product algorithm [5]. Typically, parallel implementations work well on 2D processor grids, because the input matrices are sliced horizontally and vertically into square block... |

130 | Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems
- Maheswaran, Ali, et al.
- 1999
(Show Context)
Citation Context ..., this work will serve as the baseline reference. Then we will study hybrid algorithms, i.e., algorithms which use our memory layout and are based on classical principles such as round-robin, min-min =-=[13]-=-, or a dynamic demand-driven approach. The first six algorithms below use our memory allocation, the only difference between them is the order in which the master sends blocks to workers. Homogeneous ... |

80 | Y.: Scheduling strategies for master-slave tasking on heterogeneous processor platforms
- Banino, Beaumont, et al.
- 2004
(Show Context)
Citation Context ...jective is to maximize the amount of work performed per time-unit. Altogether, we gather the linear program presented in Figure 1. The optimal solution for this system is a bandwidth-centric strategy =-=[1]-=-: we sort workers by non-decreasing values of 2ci and we enroll them as long as µ i P 2ci ≤ 1. In this µ iwi way, we can achieve the throughput ρ ≈ P 1 i enrolled . This soluwi tion seems to be close ... |

72 | Efficient collective communication in distributed heterogeneous systems
- Bhat, Raghavendra, et al.
- 1999
(Show Context)
Citation Context ...er has no processing capability (otherwise, add a fictitious extra worker paying no communication cost to simulate computation at the master). For the communication model, we adopt the one-port model =-=[4]-=-, which is defined as follows: 1) the master can only send data to, and receive data from, a single worker at a given time-step; 2) a given worker cannot start execution before it has terminated the r... |

59 | A survey of out-of-core algorithms in numerical linear algebra
- Toledo
- 1999
(Show Context)
Citation Context ...troduce notations. Next, in Section 3, we proceed with the analysis of the total communication volume that is needed in the presence of memory constraints, and we improve a well-known bound by Toledo =-=[17, 10]-=-. In order to help the reader apprehend the solution for heterogeneous platforms, we first deal with homogeneous platforms in Section 4, and we propose a scheduling algorithm that includes resource se... |

49 | A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers
- Beaumont, Boudet, et al.
(Show Context)
Citation Context ...as a same matrix element may have to be sent several times to a same processor. In this paper, we are not interested in adapting 2D processor grid strategies to heterogeneous clusters, as proposed in =-=[11, 2, 3]-=-. Instead, we adopt a realistic application scenario, where input files are read from a fixed repository (such as a disk on a data server). Computations are delegated to available computational resour... |

48 | Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems
- Zhuo, Prasanna
- 2007
(Show Context)
Citation Context ...ons are introduced and analyzed by Dongarra et al. [8]. A similar thread of work, although in a different context, deals with reconfigurable architectures, either pipelined bus systems [12], or FPGAs =-=[18]-=-. In the latter approach, tradeoffs must be found to optimize the size of the on-chip memory and the available memory bandwidth, leading to partitioned algorithms that re-use data intensively. Please ... |

38 | Understanding the behavior and performance of non-blocking communications in MPI
- Saif, Parashar
- 2004
(Show Context)
Citation Context ...ations, they claim that all these operations “are eventually serialized by the single hardware port to the network.” Experimental evidence of this fact has recently been reported by Saif and Parashar =-=[16]-=-, who 54report that asynchronous MPI sends get serialized as soon as message sizes exceed a hundred kilobytes. Their results hold for two popular MPI implementations, MPICH on Linux clusters and IBM ... |

36 | Matrix Multiplication on Heterogeneous Platforms
- Beaumont, Boudet, et al.
(Show Context)
Citation Context ...as a same matrix element may have to be sent several times to a same processor. In this paper, we are not interested in adapting 2D processor grid strategies to heterogeneous clusters, as proposed in =-=[11, 2, 3]-=-. Instead, we adopt a realistic application scenario, where input files are read from a fixed repository (such as a disk on a data server). Computations are delegated to available computational resour... |

22 | Parallel matrix multiplication on a linear array with a reconfigurable pipelined bus system
- Li, Pan
- 1999
(Show Context)
Citation Context ...st implementations are introduced and analyzed by Dongarra et al. [8]. A similar thread of work, although in a different context, deals with reconfigurable architectures, either pipelined bus systems =-=[12]-=-, or FPGAs [18]. In the latter approach, tradeoffs must be found to optimize the size of the on-chip memory and the available memory bandwidth, leading to partitioned algorithms that re-use data inten... |

18 | D.: Key concepts for parallel out-of-core LU factorization
- Dongarra, Hammarling, et al.
- 1996
(Show Context)
Citation Context ...including dense and sparse computations. We refer to [17] for a complete list of implementations. The design principles followed by most implementations are introduced and analyzed by Dongarra et al. =-=[8]-=-. A similar thread of work, although in a different context, deals with reconfigurable architectures, either pipelined bus systems [12], or FPGAs [18]. In the latter approach, tradeoffs must be found ... |

13 |
Heterogeneous distribution of computations solving linear algebra problems on networks of heterogeneous computers
- Kalinov, Lastovetsky
(Show Context)
Citation Context ...as a same matrix element may have to be sent several times to a same processor. In this paper, we are not interested in adapting 2D processor grid strategies to heterogeneous clusters, as proposed in =-=[11, 2, 3]-=-. Instead, we adopt a realistic application scenario, where input files are read from a fixed repository (such as a disk on a data server). Computations are delegated to available computational resour... |

8 |
Communication lower bounds for distributed-memory matrix multiplication
- Ironya, Toledo, et al.
(Show Context)
Citation Context ...troduce notations. Next, in Section 3, we proceed with the analysis of the total communication volume that is needed in the presence of memory constraints, and we improve a well-known bound by Toledo =-=[17, 10]-=-. In order to help the reader apprehend the solution for heterogeneous platforms, we first deal with homogeneous platforms in Section 4, and we propose a scheduling algorithm that includes resource se... |

2 | Revisiting matrix product on master-worker platforms
- Pineau, Robert, et al.
- 2006
(Show Context)
Citation Context ... size 8000 × 8000 for A and 8000 × 320000 for B. The results of these experiments are summarized in Figure 8. The results on the actual platform are similar to those obtained on homogeneous platforms =-=[14]-=-. All the algorithms but BMM have similar makespan. All algorithms making resource selection use eleven workers among the twenty available, which explains why they achieve similar relative work. The r... |