## Performance Analysis of MPI Collective Operations (2005)

### Cached

### Download Links

- [www.netlib.org]
- [icl.cs.utk.edu]
- [www.netlib.org]
- [www.netlib.org]
- [icl.cs.utk.edu]
- [icl.cs.utk.edu]
- [www.netlib.org]
- [www.netlib.org]
- [icl.cs.utk.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 15 |

Citations: | 51 - 6 self |

### BibTeX

@INPROCEEDINGS{Angskun05performanceanalysis,

author = {Thara Angskun and George Bosilca and Graham E. Fagg and Edgar Gabriel and Jack J. Dongarra},

title = {Performance Analysis of MPI Collective Operations},

booktitle = {In: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Workshop 15},

year = {2005}

}

### OpenURL

### Abstract

Previous studies of application usage show that the performance of collective communica-tions are critical for high performance computing and are often overlooked when compared to the point-to-point performance. In this paper we attempt to analyze and improve collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP. The predictions from the models were compared to the experimentally gathered data and our findings were used to optimize the implementation of collective operations in the FT-MPI library. 1

### Citations

721 | High-performance, portable implementation of the MPI message passing interface standard
- Gropp, Lusk, et al.
- 1996
(Show Context)
Citation Context ... Ram, connected via Gigabit Ethernet. Model Parameters We measured model parameters using different MPI implementation. Most of the collected data was generated using FT-MPI [7], MPICH-1, and MPICH-2 =-=[8]-=- 4 . Parameter values measured using MPICH-1 had higher latency and gap values with lower bandwidth than both FT-MPI and MPICH2. FT-MPI and MPICH-2 had similar values for these parameters. Hockney mod... |

497 | Eicken. Logp: Towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ... and the rank of the root process. There are many parallel communicational models that predict performance of any given collective operation based on standardized system parameters. Hockney [9], LogP =-=[5]-=-, LogGP [1], and PLogP [10] models are frequently used to analyze parallel algorithm performance. Assessing the parameters for these models within local area network is relatively straightforward and ... |

235 | LogGP: Incorporating long messages into the LogP model for parallel computation
- Alexandrov, Ionescu, et al.
- 1997
(Show Context)
Citation Context ...nk of the root process. There are many parallel communicational models that predict performance of any given collective operation based on standardized system parameters. Hockney [9], LogP [5], LogGP =-=[1]-=-, and PLogP [10] models are frequently used to analyze parallel algorithm performance. Assessing the parameters for these models within local area network is relatively straightforward and the methods... |

152 | Open MPI: Goals, concept, and design of a next generation MPI implementation
- Gabriel, Fagg, et al.
- 2004
(Show Context)
Citation Context ...on function, but can be used as a library for any MPI implementation. For example, this work is currently being used to produce a new tuned collective module in the open source OpenMPI Implementation =-=[19]-=-. In FT-MPI experimental and analytical analysis of collective algorithm performance was used to determine switching points between available methods. At run time, based on a static table of values, a... |

150 | MagPIe: MPI’s collective communication operations for clustered wide area systems
- Kielmann, Hofman, et al.
- 1999
(Show Context)
Citation Context ... of different parallel communication models. Thakur et al. [14] and Rabenseifner et al. [13] use Hockney model to analyze the performance of different collective operation algorithms. Kielmann et al. =-=[11]-=- use PLogP model to find optimal algorithm and parameters for collective operations incorporated in the MagPIe library. Bell et al. [2] use extensions of LogP and LogGP models to evaluate high perform... |

103 |
Thorsten von Eicken. LogP: towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ...nodes took more than three hours[2].sThere are many parallel communication models that predict performance of any given collective operation based on standardized system parameters. Hockney [3], LogP =-=[4]-=-, LogGP [5], and PLogP [6] models are frequently used to analyze parallel algorithm performance. Assessing the parameters for these models within local area network is relatively straightforward and t... |

84 | Efficient algorithms for all-to-all communications in multiport message-passing systems
- Bruck, Ho, et al.
- 1997
(Show Context)
Citation Context ...the remaining nodes have at least entered the barrier. We implemented four different algorithms for the Barrier collective: flat-tree/linear fan-in-fan-out, double ring, recursive doubling, and Bruck =-=[4]-=- algorithm. In flat-tree/linear fan-in-fan-out algorithm all nodes report to a preselected root; once everyone has reported to the root, the root sends a releasing message to everyone. In the double r... |

81 | Lusk: Reproducible measurements of MPI performance characteristics. DuroPVM/MPI'99, septembre
- Gropp
- 1999
(Show Context)
Citation Context ...gP models on both clusters. Table 5 summarizes the parameter values for LogP/LogGP model. Performance tests. Our performance measuring methodology follows the recommendations given by Gropp et al. in =-=[18]-=- to ensure the reproducibility of the measured results. We minimize the effects of pipelining by forcing a “report-to-root” step after each collective operation. Each of the collected data points is a... |

50 |
Assessing Fast Network Interfaces
- Culler, Liu, et al.
- 1996
(Show Context)
Citation Context ...erformance. Assessing the parameters for these models within local area network is relatively straightforward and the methods to approximate them have already been established and are well understood =-=[6]-=-,[10]. The major contribution of this paper is the direct comparison of Hockney, LogP, LogGP, and 1 We define “optimal implementation” in the following way: given a set of available algorithms for the... |

47 | Fast measurement of LogP parameters for message passing platforms
- Kielmann, Bal, et al.
- 2000
(Show Context)
Citation Context ...process. There are many parallel communicational models that predict performance of any given collective operation based on standardized system parameters. Hockney [9], LogP [5], LogGP [1], and PLogP =-=[10]-=- models are frequently used to analyze parallel algorithm performance. Assessing the parameters for these models within local area network is relatively straightforward and the methods to approximate ... |

47 |
Abhijit Sahay, Klaus Erik Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ...e, and the rank of the root process. There are many parallel communication models that predict performance of any given collective operation based on standardized system parameters. Hockney [3], LogP =-=[4]-=-, LogGP [5], and PLogP [6] models are frequently used to analyze parallel algorithm performance. Assessing the parameters for these models within local area network is relatively straightforward and t... |

46 | Automatically Tuned Collective Communications
- Vadhiyar, Fagg, et al.
- 2000
(Show Context)
Citation Context ...of tests over a parameter space for the collective on a dedicated system. However, running such detailed tests even on relatively small clusters (32 - 64 nodes), can take a substantial amount of time =-=[15]-=- 2 . If one were to analyze all of the MPI collectives in a similar manner, the tuning process could take days. Still, many of current MPI implementations use “extensive” testing to determine switchin... |

40 |
The communication challenge for MPP, Intel Paragon and Meiko CS-2
- Hockney
- 1994
(Show Context)
Citation Context ...sage size, and the rank of the root process. There are many parallel communicational models that predict performance of any given collective operation based on standardized system parameters. Hockney =-=[9]-=-, LogP [5], LogGP [1], and PLogP [10] models are frequently used to analyze parallel algorithm performance. Assessing the parameters for these models within local area network is relatively straightfo... |

34 |
Network performance-aware collective communication for clustered wide area systems. Parallel Computing, 2001. accepted for publication
- Kielmann, Bal, et al.
(Show Context)
Citation Context ...l Duration Related work Linear Hockney T = ns · (P − 1) · (α(ms) + ms · β(ms)) [9], [15] Linear LogP/LogGP T = L + 2 · o − g + ns × (P − 1) × ((ms − 1)G + g) Linear PLogP T = L + ns · (P − 1) · g(ms) =-=[16]-=- Pipeline Hockney T = (P + ns − 2) × (α(ms) + ms · β(ms)) Pipeline (P − 1) × (L + 2 · o + (ms − 1)G)+ LogP/LogGP T = (ns − 1) × (g + (ms − 1)G) Pipeline PLogP T = (P − 1) × (L + g(ms)) + (ns − 1) × g(... |

31 | An evaluation of current high-performance networks
- Bell, Bonachea, et al.
- 2003
(Show Context)
Citation Context ...e of different collective operation algorithms. Kielmann et al. [11] use PLogP model to find optimal algorithm and parameters for collective operations incorporated in the MagPIe library. Bell et al. =-=[2]-=- use extensions of LogP and LogGP models to evaluate high performance networks. Bernaschi et al. [3] analyze the efficiency of reduce-scatter collective using LogGP model. Vadhiyar et al. [15] used a ... |

25 | Improving the Performance of Collective Operations in MPICH
- Thakur, Gropp
- 2003
(Show Context)
Citation Context ... area of research in recent years. Important aspect of collective algorithm optimizations is understanding the algorithm performance in terms of different parallel communication models. Thakur et al. =-=[14]-=- and Rabenseifner et al. [13] use Hockney model to analyze the performance of different collective operation algorithms. Kielmann et al. [11] use PLogP model to find optimal algorithm and parameters f... |

25 |
Introduction to Parallel Computing”, Second Edition, E-Book
- Grama, Gupta, et al.
- 2003
(Show Context)
Citation Context ... of research in recent years. An important aspect of collective algorithm optimizations is understanding the algorithm performance in terms of different parallel communication models. Grama et al. in =-=[9]-=- use Hockney model to perform cost analysis of different collective algorithms on various network topologies (such as torus, hypercube, etc). In [10], Thakur et al. discuss optimizations of their MPIC... |

24 |
Automatic MPI counter profiling of all users: First results on a CRAY T3E 900-512
- Rabenseifner
- 1999
(Show Context)
Citation Context ...s in the FT-MPI library. 1 Introduction Previous studies of application usage show that the performance of collective communications are critical to high performance computing (HPC ). Profiling study =-=[12]-=- showed that applications spend more than eighty percent of a transfer time in collective operations. Given this fact, it is essential for MPI implementations to provide high-performance collective op... |

17 | Fault tolerant communication library and applications for high perofrmance
- Fagg, Gabriel, et al.
(Show Context)
Citation Context ... hours[15]. 2sPLogP based parallel communication models applied to MPI collective operations. Indirectly, this work was used to implement and optimize the collective operation subsystem of the FT-MPI =-=[7]-=- library. The rest of this paper proceeds as follows. Section 2 discusses related work. Section 3 examines parallel communication models of interest; Section 4 discusses the Optimized Collective Commu... |

14 |
de Geijn. On optimizing collective communication
- Chan, Heimlich, et al.
- 2004
(Show Context)
Citation Context ...roadcast and flat-tree barrier algorithm over various communicator and message sizes. We chose to fitsBroadcast Model Duration Related work Linear Hockney T = ns · (P − 1) · (α(ms) + ms · β(ms)) [9], =-=[15]-=- Linear LogP/LogGP T = L + 2 · o − g + ns × (P − 1) × ((ms − 1)G + g) Linear PLogP T = L + ns · (P − 1) · g(ms) [16] Pipeline Hockney T = (P + ns − 2) × (α(ms) + ms · β(ms)) Pipeline (P − 1) × (L + 2 ... |

10 |
More efficient reduction algorithms for non-power-of-two number of processors in messagepassing parallel systems
- Rabenseifner, Träff
(Show Context)
Citation Context ...ears. Important aspect of collective algorithm optimizations is understanding the algorithm performance in terms of different parallel communication models. Thakur et al. [14] and Rabenseifner et al. =-=[13]-=- use Hockney model to analyze the performance of different collective operation algorithms. Kielmann et al. [11] use PLogP model to find optimal algorithm and parameters for collective operations inco... |

9 | M.: Efficient implementation of reduce-scatter in MPI
- Bernaschi, Iannello, et al.
- 2003
(Show Context)
Citation Context ... algorithm and parameters for collective operations incorporated in the MagPIe library. Bell et al. [2] use extensions of LogP and LogGP models to evaluate high performance networks. Bernaschi et al. =-=[3]-=- analyze the efficiency of reduce-scatter collective using LogGP model. Vadhiyar et al. [15] used a modified LogP model which took into account the number of pending requests that had been queued. 3 P... |

4 | Jesper Larsson Träff. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems - Rabenseifner - 2004 |

3 | de Geijn, R.M.: On optimizing of collective communication - Chan, Heimlich, et al. - 2004 |

3 |
G.: Fast tuning of intra-cluster collective communications
- Barchet-Estefanel, Mounié
- 2004
(Show Context)
Citation Context ...ems. Across high-latency, wide-area links MagPIe selects segmented linear algorithms for collectives, while various tree-based algorithms are used in low-latency environment. Barchet-Estefanel et al. =-=[14]-=- use PLogP model to evaluate performance of broadcast and scatter operation on intracluster communication. Bell et al. [15] use extensions of LogP and LogGP models to evaluate performance of small and... |

1 | de Geijn, R.A.: On optimizing collective communication - Chan, Heimlich, et al. - 2004 |