## Revisiting matrix product on master-worker platforms (2006)

### Cached

### Download Links

- [graal.ens-lyon.fr]
- [graal.ens-lyon.fr]
- [hal.inria.fr]
- [graal.ens-lyon.fr]
- [graal.ens-lyon.fr]
- [www.netlib.org]
- [www.netlib.org]
- [www.netlib.org]
- DBLP

### Other Repositories/Bibliography

Citations: | 2 - 2 self |

### BibTeX

@TECHREPORT{Pineau06revisitingmatrix,

author = {Jean-françois Pineau and Yves Robert and Frédéric Vivien and Zhiao Shi and Jack Dongarra},

title = {Revisiting matrix product on master-worker platforms},

institution = {},

year = {2006}

}

### OpenURL

### Abstract

This paper is aimed at designing efficient parallel matrix-product algorithms for heterogeneous master-worker platforms. While matrix-product is well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative:- Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLA-PACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files).- Heterogeneous star-shaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources.- Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and re-used for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number mi of buffers, where a buffer can store a square block of matrix elements. The size q of these square blocks is chosen so as to harness the power of Level 3 BLAS routines: q = 80 or 100 on most platforms. We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at École Normale Supérieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms. 1 1

### Citations

8833 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ...≤ m αrecv + βrecv + γrecv + γsend = m The following lemma is given in [38]: consider any algorithm that uses the standard way of multiplying matrices (this excludes Strassen’s or Winograd’s algorithm =-=[19]-=-, for instance). If NA elements of A, NB elements of B and NC elements of C are accessed, then no more than K computations can be done, where � K = min (NA + NB) � NC, (NA + NC) � NB, (NB + NC) � � NA... |

387 | Automatically tuned linear algebra software
- Whaley, Dongarra
- 1998
(Show Context)
Citation Context ...icients but instead square blocks of size q × q (hence with q 2 coefficients). This is to harness the power of Level 3 BLAS routines [12]. Typically, q = 80 or 100 when using ATLAS-generated routines =-=[40]-=-. �The input matrix A is of size nA × nAB: - we split A into r horizontal stripes Ai, 1 ≤ i ≤ r, where r = nA/q; - we split each stripe Ai into t square q × q blocks Ai,k, 1 ≤ k ≤ t, where t = nAB/q. ... |

363 |
R.: ScaLAPACK User’s Guide
- Blackford, Choi, et al.
- 1997
(Show Context)
Citation Context ... in many scientific applications, and it has been extensively studied on parallel architectures. Two well-known parallel versions are Cannon’s algorithm [14] and the ScaLAPACK outer product algorithm =-=[13]-=-. Typically, parallel implementations work well on 2D processor grids, because the input matrices are sliced horizontally and vertically into square blocks that are mapped one-to-one onto the physical... |

173 |
I/O complexity: The red-blue pebble game
- Hong, Kung
- 1981
(Show Context)
Citation Context ...entioned, the design of parallel algorithms for limited memory processors is very similar to the design of out-of-core routines for classical parallel machines. On the theoretical side, Hong and Kung =-=[26]-=- investigate the I/O complexity of several computational kernels in their pioneering paper. Toledo [38] proposes a nice survey on the design of out-of-core algorithms for linear algebra, including den... |

155 | ScaLAPACK: a portable linear algebra library for distributed memory computers–design issues and performance
- Choi, Demmel, et al.
- 1996
(Show Context)
Citation Context ...proach. The atomic elements that we manipulate are not matrix coefficients but instead square blocks of size q × q (hence with q 2 coefficients). This is to harness the power of Level 3 BLAS routines =-=[12]-=-. Typically, q = 80 or 100 when using ATLAS-generated routines [40]. �The input matrix A is of size nA × nAB: - we split A into r horizontal stripes Ai, 1 ≤ i ≤ r, where r = nA/q; - we split each stri... |

138 | Dynamic Matching and Scheduling of a Class of Independent Tasks onto Heterogenous Computing Systems
- Maheswaran, Ali, et al.
- 1999
(Show Context)
Citation Context ...are communication slots, and �Enroll a new worker (and send blocks to it) only if this does not delay previously enrolled workers. Min-min: This algorithm is based on the well-known min-min heuristic =-=[33]-=-. At each step, all tasks are considered. For each of them, we compute their possible starting date on each worker, given the files that have already been sent to this worker and all decisions taken p... |

136 |
A cellular computer to implement the Kalman Filter Algorithm
- Cannon
- 1969
(Show Context)
Citation Context ...on Matrix product is a key computational kernel in many scientific applications, and it has been extensively studied on parallel architectures. Two well-known parallel versions are Cannon’s algorithm =-=[14]-=- and the ScaLAPACK outer product algorithm [13]. Typically, parallel implementations work well on 2D processor grids, because the input matrices are sliced horizontally and vertically into square bloc... |

85 | Stochastic scheduling
- Schopf, Berman
- 1999
(Show Context)
Citation Context ...rker techniques or paradigms based upon the idea “use the past to predict the future”, i.e. use the currently observed speed of computation of each machine to decide for the next distribution of work =-=[17, 18, 9]-=-. Dynamic strategies such INRIAsRevisiting Matrix Product on Master-Worker Platforms 31 as self-guided scheduling [34] could be useful too. There is a challenge in determining a trade-off between the ... |

84 | Scheduling strategies for master-slave tasking on heterogeneous processor platforms
- Banino, Beaumont, et al.
(Show Context)
Citation Context ...tion for yi is yi = 2xi , so the problem can be reduced to : µi ⎧ ⎪⎨ ⎪⎩ Maximize � i xi subject to ∀i, xi ≤ 1 � i wi 2ci µi xi ≤ 1 The optimal solution for this system is a bandwidth-centric strategy =-=[8, 3]-=-; we sort workers by non-decreasing values of 2ci µi and we enroll them as long as � 2ci ≤ 1. In this way, we µiwi can achieve the throughput ρ ≈ � i enrolled 1 wi . This solution seems to be close to... |

78 | M.: An enabling framework for master-worker applications on the computational grid
- Goux, Kulkarni, et al.
- 2000
(Show Context)
Citation Context ...imultaneous communications for a given node is not bounded. This approach has also been studied in [25]. Enabling frameworks to facilitate the implementation of master-worker tasking are described in =-=[23, 39]-=-. 10 Conclusion The main contributions of this paper are the following: 1. On the theoretical side, we have derived a new, tighter, bound on the minimal volume of communications needed to multiply two... |

74 | Customized dynamic load balancing for a network of workstations
- Zaki, Li, et al.
- 1996
(Show Context)
Citation Context ...rker techniques or paradigms based upon the idea “use the past to predict the future”, i.e. use the currently observed speed of computation of each machine to decide for the next distribution of work =-=[17, 18, 9]-=-. Dynamic strategies such INRIAsRevisiting Matrix Product on Master-Worker Platforms 31 as self-guided scheduling [34] could be useful too. There is a challenge in determining a trade-off between the ... |

73 | Ecient collective communication in distributed heterogeneous systems
- Bhat, Raghavendra, et al.
- 2003
(Show Context)
Citation Context ... capability (otherwise, add a fictitious extra worker paying no communication cost to simulate computation at the master). Next, we need to define the communication model. We adopt the one-port model =-=[10, 11]-=-, which is defined as follows: �the master can only send data to, and receive data from, a single worker at a given time-step, RR n�0123456789s8 J. Dongarra, J.-F. Pineau, Y. Robert, Z. Shi, F. Vivien... |

60 | A Survey of Out-of-Core Algorithms in Numerical Linear Algebra
- Toledo
- 1999
(Show Context)
Citation Context ...cheduling problem. Next, in Section 4, we proceed with the analysis of the total communication volume that is needed in the presence of memory constraints, and we improve a well-known bound by Toledo =-=[38, 27]-=-. We deal with homogeneous platforms in Section 5, and we propose a scheduling algorithm that includes resource selection. Section 6 is the counterpart for heterogeneous platforms, but the algorithms ... |

56 | Ecient collective communication on heterogeneous networks of workstations
- Banikazemi, Moorthy, et al.
- 1998
(Show Context)
Citation Context ...e-port model. The one-port model fully accounts for the heterogeneity of the platform, as each link has a different bandwidth. It generalizes a simpler model studied by Banikazemi, Moorthy, and Panda =-=[1]-=-, Liu [32], and Khuller and Kim [30]. In this simpler model, the communication time only depends on the sender, not on the receiver. In other words, the communication speed from a processor to all its... |

54 | Adaptive scheduling for master-worker applications on the computational grid
- Heymann, Senar, et al.
- 2000
(Show Context)
Citation Context ...owing for fully simultaneous communications. Master-worker on the computational grid – Master-worker scheduling on the grid can be based on a network-flow approach [37, 36] or on an adaptive strategy =-=[24]-=-. Note that the network-flow approach of [37, 36] is possible only when using a full multipleport model, where the number of simultaneous communications for a given node is not bounded. This approach ... |

53 | A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers
- Beaumont, Boudet, et al.
- 2001
(Show Context)
Citation Context ...proportional to its relative computing speed. There are many processor arrangements to consider, and determining the optimal one is a highly combinatorial problem, which has been proven NPcomplete in =-=[5]-=-. In fact, because of the geometric constraints imposed by the 2D processor grid, a perfect load-balancing can only be achieved in some very particular cases. The second approach is to relax the geome... |

52 | Scalable and modular algorithms for floating-point matrix multiplication on FPGAs - Zhuo, Prasanna - 2004 |

51 | Message multicasting in heterogeneous networks
- Bar-Noy, Guha, et al.
- 2000
(Show Context)
Citation Context ...l, the communication time only depends on the sender, not on the receiver. In other words, the communication speed from a processor to all its neighbors is the same. Finally, we note that some papers =-=[2, 4]-=- depart form the one-port model as they allow a sending processor to initiate another communication while a previous one is still on-going on the network. However, such models insist that there is an ... |

51 |
Compiler optimizations for enhancing parallelism and their impact on architecture design
- Polychronopoulos
- 1988
(Show Context)
Citation Context ...of computation of each machine to decide for the next distribution of work [17, 18, 9]. Dynamic strategies such INRIAsRevisiting Matrix Product on Master-Worker Platforms 31 as self-guided scheduling =-=[34]-=- could be useful too. There is a challenge in determining a trade-off between the data distribution parameters and the process spawning and possible migration policies. Redundant computations might al... |

44 | Master/Slave Computing on the Grid
- Shao, Wolski, et al.
- 2000
(Show Context)
Citation Context ...other operation, so they are not allowing for fully simultaneous communications. Master-worker on the computational grid – Master-worker scheduling on the grid can be based on a network-flow approach =-=[37, 36]-=- or on an adaptive strategy [24]. Note that the network-flow approach of [37, 36] is possible only when using a full multipleport model, where the number of simultaneous communications for a given nod... |

41 | Y.: Matrix Multiplication on Heterogeneous Platforms
- Beaumont, Boudet, et al.
(Show Context)
Citation Context ...mented sequentially. With this hypothesis, minimizing the total communication cost amounts to minimizing the total communication volume. Unfortunately, this problem has been shown NP-complete as well =-=[6]-=-. Note that even under the optimistic assumption that all communications at a given step of the algorithm can take place in parallel, the problem remains NP-complete [7]. In this paper, we do not try ... |

39 | Understanding the behavior and performance of non-blocking communications in MPI
- Saif, Parashar
- 2004
(Show Context)
Citation Context ...ations, they claim that all these operations “are eventually serialized by the single hardware port to the network.” Experimental evidence of this fact has recently been reported by Saif and Parashar =-=[35]-=-, who report that asynchronous MPI sends get serialized as soon as message sizes exceed a hundred kilobytes. Their result hold for two popular MPI implementations, MPICH on Linux clusters and IBM MPI ... |

37 | Block data decomposition for data-parallel programming on a heterogeneous workstation network
- Crandall, Quinn
- 1993
(Show Context)
Citation Context ...ithin each processor column independently; next the load is balanced between columns; this is the “heterogeneous block cyclic distribution” of [29]. Another approach is proposed by Crandall and Quinn =-=[20]-=-, who propose a recursive partitioning algorithm, and by Kaddoura, Ranka and Wang [28], who refine the latter algorithm and provide several variations. They report several numerical simulations. As po... |

35 | Compile-time scheduling algorithms for heterogeneous network of workstations
- Cierniak, Li, et al.
- 1997
(Show Context)
Citation Context ...rker techniques or paradigms based upon the idea “use the past to predict the future”, i.e. use the currently observed speed of computation of each machine to decide for the next distribution of work =-=[17, 18, 9]-=-. Dynamic strategies such INRIAsRevisiting Matrix Product on Master-Worker Platforms 31 as self-guided scheduling [34] could be useful too. There is a challenge in determining a trade-off between the ... |

33 | Heterogeneous distribution of computations while solving linear algebra problems on networks of heterogeneous computers
- Kalinov, Lastovetsky
(Show Context)
Citation Context ...erogeneous clusters – Several authors have dealt with the static implementation of matrix-multiplication algorithms on heterogeneous platforms. One simple approach is given by Kalinov and Lastovetsky =-=[29]-=-. Their idea is to achieve a perfect load-balance as follows: first they take a fixed layout of processors arranged as a collection of processor columns; then the load is evenly balanced within each p... |

33 | Broadcast scheduling optimization for heterogeneous cluster systems
- Liu
(Show Context)
Citation Context ...del. The one-port model fully accounts for the heterogeneity of the platform, as each link has a different bandwidth. It generalizes a simpler model studied by Banikazemi, Moorthy, and Panda [1], Liu =-=[32]-=-, and Khuller and Kim [30]. In this simpler model, the communication time only depends on the sender, not on the receiver. In other words, the communication speed from a processor to all its neighbors... |

33 | Adaptive Scheduling of Master/Worker Applications on Distributed Computational Resources
- Shao
- 2001
(Show Context)
Citation Context ...other operation, so they are not allowing for fully simultaneous communications. Master-worker on the computational grid – Master-worker scheduling on the grid can be based on a network-flow approach =-=[37, 36]-=- or on an adaptive strategy [24]. Note that the network-flow approach of [37, 36] is possible only when using a full multipleport model, where the number of simultaneous communications for a given nod... |

29 | Communication modeling of heterogeneous networks of workstations for performance characterization of collective operations
- Banikazemi, Sampathkumar, et al.
- 1999
(Show Context)
Citation Context ...l, the communication time only depends on the sender, not on the receiver. In other words, the communication speed from a processor to all its neighbors is the same. Finally, we note that some papers =-=[2, 4]-=- depart form the one-port model as they allow a sending processor to initiate another communication while a previous one is still on-going on the network. However, such models insist that there is an ... |

27 | Centralized versus distributed schedulers for multiple bag-of-task applications
- Beaumont, Carter, et al.
- 2006
(Show Context)
Citation Context ...tion for yi is yi = 2xi , so the problem can be reduced to : µi ⎧ ⎪⎨ ⎪⎩ Maximize � i xi subject to ∀i, xi ≤ 1 � i wi 2ci µi xi ≤ 1 The optimal solution for this system is a bandwidth-centric strategy =-=[8, 3]-=-; we sort workers by non-decreasing values of 2ci µi and we enroll them as long as � 2ci ≤ 1. In this way, we µiwi can achieve the throughput ρ ≈ � i enrolled 1 wi . This solution seems to be close to... |

24 | Array decomposition for nonuniform computational environments
- Kaddoura, Ranka, et al.
- 1995
(Show Context)
Citation Context ...this is the “heterogeneous block cyclic distribution” of [29]. Another approach is proposed by Crandall and Quinn [20], who propose a recursive partitioning algorithm, and by Kaddoura, Ranka and Wang =-=[28]-=-, who refine the latter algorithm and provide several variations. They report several numerical simulations. As pointed out in the introduction, theoretical results for matrix multiplication and LU de... |

23 | Self-adapting software for numerical linear algebra and LAPACK for clusters
- Chen, Dongarra, et al.
- 2003
(Show Context)
Citation Context ...pproach. Recent papers aim at making easier the process of tuning linear algebra kernels on heterogeneous systems. Self-optimization methodologies are described by Cuenca et al [21] and by Chen et al =-=[16]-=-. Along the same line, Chakravarti et al. [15] describe an implementation of Cannon’s algorithm using self-organizing agents on a peer-to-peer network. Models for heterogeneous platforms – In the lite... |

23 |
On broadcasting in heterogenous networks
- Khuller, Kim
- 2004
(Show Context)
Citation Context ...lly accounts for the heterogeneity of the platform, as each link has a different bandwidth. It generalizes a simpler model studied by Banikazemi, Moorthy, and Panda [1], Liu [32], and Khuller and Kim =-=[30]-=-. In this simpler model, the communication time only depends on the sender, not on the receiver. In other words, the communication speed from a processor to all its neighbors is the same. This would r... |

22 | Parallel matrix multiplication on a linear array with a reconfigurable pipelined bus system - Li, Pan - 1999 |

21 | Data Partitioning with a Realistic Performance Model of Networks of Heterogeneous Computers
- Lastovetsky, Reddy
- 2004
(Show Context)
Citation Context ...trix multiplication and LU decomposition on 2D-grids of heterogeneous processors are reported in [5], while extensions to general 2D partitioning are considered in [6]. See also Lastovetsky and Reddy =-=[31]-=- for another partitioning approach. Recent papers aim at making easier the process of tuning linear algebra kernels on heterogeneous systems. Self-optimization methodologies are described by Cuenca et... |

20 | Applicationspecific scheduling for the organic grid
- Chakravarti, Baumgartner, et al.
(Show Context)
Citation Context ...e process of tuning linear algebra kernels on heterogeneous systems. Self-optimization methodologies are described by Cuenca et al [21] and by Chen et al [16]. Along the same line, Chakravarti et al. =-=[15]-=- describe an implementation of Cannon’s algorithm using self-organizing agents on a peer-to-peer network. Models for heterogeneous platforms – In the literature, one-port models come in two variants. ... |

18 | D.: Key concepts for parallel out-of-core LU factorization
- Dongarra, Hammarling, et al.
- 1996
(Show Context)
Citation Context ...including dense and sparse computations. We refer to [38] for a complete list of implementations. The design principles followed by most implementations are introduced and analyzed by Dongarra et al. =-=[22]-=-. Linear algebra algorithms on heterogeneous clusters – Several authors have dealt with the static implementation of matrix-multiplication algorithms on heterogeneous platforms. One simple approach is... |

17 |
Scheduling multi-component applications in heterogeneous wide-area networks
- Weissman
- 2000
(Show Context)
Citation Context ...imultaneous communications for a given node is not bounded. This approach has also been studied in [25]. Enabling frameworks to facilitate the implementation of master-worker tasking are described in =-=[23, 39]-=-. 10 Conclusion The main contributions of this paper are the following: 1. On the theoretical side, we have derived a new, tighter, bound on the minimal volume of communications needed to multiply two... |

16 |
Bandwidth-aware resource allocation for heterogeneous computing systems to maximize throughput
- Hong, Prasanna
(Show Context)
Citation Context ...low approach of [37, 36] is possible only when using a full multipleport model, where the number of simultaneous communications for a given node is not bounded. This approach has also been studied in =-=[25]-=-. Enabling frameworks to facilitate the implementation of master-worker tasking are described in [23, 39]. 10 Conclusion The main contributions of this paper are the following: 1. On the theoretical s... |

12 | Partitioning a square into rectangles: NP-completeness and approximation algorithms
- Beaumont, Rastello, et al.
(Show Context)
Citation Context ...been shown NP-complete as well [6]. Note that even under the optimistic assumption that all communications at a given step of the algorithm can take place in parallel, the problem remains NP-complete =-=[7]-=-. In this paper, we do not try to adapt the 2D processor grid strategy to heterogeneous clusters. Instead, we adopt a realistic application scenario, where input files are read from a fixed repository... |

8 |
Communication lower bounds for distributed-memory matrix multiplication
- Ironya, Toledo, et al.
(Show Context)
Citation Context ...cheduling problem. Next, in Section 4, we proceed with the analysis of the total communication volume that is needed in the presence of memory constraints, and we improve a well-known bound by Toledo =-=[38, 27]-=-. We deal with homogeneous platforms in Section 5, and we propose a scheduling algorithm that includes resource selection. Section 6 is the counterpart for heterogeneous platforms, but the algorithms ... |

4 |
Processes distribution of homogeneous parallel linear algebra routines on heterogeneous clusters
- Cuenca, Garcia, et al.
- 2005
(Show Context)
Citation Context ... another partitioning approach. Recent papers aim at making easier the process of tuning linear algebra kernels on heterogeneous systems. Self-optimization methodologies are described by Cuenca et al =-=[21]-=- and by Chen et al [16]. Along the same line, Chakravarti et al. [15] describe an implementation of Cannon’s algorithm using self-organizing agents on a peer-to-peer network. Models for heterogeneous ... |

1 |
Schedulingstrategies for master-slave tasking on heterogeneous processor platforms
- Banino, Beaumont, et al.
- 2000
(Show Context)
Citation Context ...lution for yi is yi = 2xiui , so the problem can be reduced to :8??! ??: Maximize Pi xi subject to8 i, xi <= 1wiP i 2ci ui xi <= 1 The optimal solution for this system is a bandwidth-centric strategy =-=[8, 3]-=-; we sort workers bynon-decreasing values of 2ci ui and we enroll them as long as P 2ci uiwi <= 1. In this way, we can achievethe throughput ! ij Pi enrolled 1wi . This solution seems to be close to t... |

1 |
Partitioning a square into rectangles:NP-completeness and approximation algorithms. Algorithmica
- Beaumont, Boudet, et al.
- 2002
(Show Context)
Citation Context ... been shown NP-complete as well [6].Note that even under the optimistic assumption that all communications at a given step of the algorithm can take place in parallel, the problem remains NP-complete =-=[7]-=-.s2 J. Dongarra, J.-F. Pineau, Y. Robert, Z. Shi, F. Vivien In this paper, we do not try to adapt the 2D processor grid strategy to heterogeneous clusters.Instead, we adopt a realistic application sce... |

1 |
Self-organizing scheduling on the organicgrid. Int. Journal of High Performance Computing Applications
- Chakravarti, Baumgartner, et al.
- 2006
(Show Context)
Citation Context ... process of tuning linear algebra kernels on hetero-geneous systems. Self-optimization methodologies are described by Cuenca et al [21] and by Chen et al [16]. Along the same line, Chakravarti et al. =-=[15]-=- describe an implementation ofCannon's algorithm using self-organizing agents on a peer-to-peer network. Models for heterogeneous platforms - In the literature, one-port models come in two vari-ants. ... |

1 |
Dynamic matching andscheduling of a class of independent tasks onto heterogeneous computing systems
- Maheswaran, Ali, et al.
- 1999
(Show Context)
Citation Context ...are communication slots, and * Enroll a new worker (and send blocks to it) only if this does not delay previously enrolledworkers. Min-min: This algorithm is based on the well-known min-min heuristic =-=[33]-=-. At each step, alltasks are considered. For each of them, we compute their possible starting date on each worker, given the files that have already been sent to this worker and all decisions takenpre... |