### Citations

522 | The Omega test: a fast and practical integer programming algorithm for dependence analysis.
- Pugh
- 1991
(Show Context)
Citation Context ...te Job Data Flow representation (JDF). In the case of the sequential pseudo-code, it is translated in the JDF format, using a provided tool (H2J) based on the integer programming framework Omega-Test =-=[19]-=- to extract G. Bosilca et al. / Parallel Computing 38 (2012) 37–51 41itself the scheduling functions, thus alleviating the need for a centralized approach of scheduling. To handle load imbalance betwe... |

229 | A Taxonomy of Workflow Management Systems for Grid Computing,
- Yu, Buyya
- 2005
(Show Context)
Citation Context ...e specifically, in a distributed context, the dataflow execution model is iconic of DAG based approaches [3], which have mostly been applied, in the last two decades, to grid and peer-to-peer systems =-=[4,5]-=-. Recently, several projects [6–10], mostly in the field of linear algebra, have proposed to revive the general use of DAGs, as an approach to tackle the challenges of harnessing the power of multi-co... |

216 |
Operating Systems Theory.
- Denning
- 1973
(Show Context)
Citation Context ...rmalize the requirements for two tasks to be executed in parallel, have been isolated [1]; Direct Acyclic Graphs are a convenient abstraction of these conditions, with a large variety of applications =-=[2]-=-. More specifically, in a distributed context, the dataflow execution model is iconic of DAG based approaches [3], which have mostly been applied, in the last two decades, to grid and peer-to-peer sys... |

170 | ScaLAPACK, a portable linear algebra library for distributed memory computers-design issues and performance,
- Choi, Demmel, et al.
- 1996
(Show Context)
Citation Context ...ation happens on the padding but complete tiles are transferred over the network nonetheless. P and Q, control the process grid used to map the block cyclic distribution of the matrices. According to =-=[31]-=- and to our experiments, a close to square process grid, with P 6 Q, minimize the communications while balancing computations. Consequently, for all the results presented in this paper, the process gr... |

169 | A class of parallel tiled linear algebra algorithms for multicore architectures,
- Buttari, Langou, et al.
- 2009
(Show Context)
Citation Context ...g strategies. Tile algorithms have a long history of research in the domain of linear algebra [23], and their use for multicore shared memory architectures led to significant performance improvements =-=[24]-=-. 4.2. Matrix factorizations The QR factorization (or QR decomposition) offers a numerically stable way of solving full rank underdetermined, overdetermined, and regular square linear systems of equat... |

140 | Grid’5000: A large scale and highly reconfigurable experimental grid testbed,”
- Bolze
- 2006
(Show Context)
Citation Context ...he same parameters. 1 Available online at http://www8.cs.umu.se/larsk/index.html.5.1. Experimental conditions Platforms. The Griffon cluster is one of the clusters of the Grid’5000 experimental grid =-=[28]-=-. It is a 648 core machine composed of 81 dual socket Intel Xeon L5420 quad core processors at 2.5 GHz with 16 GB of memory, interconnected by a 20Gbs Infiniband network. Linux 2.6.24 (Debian Sid) is ... |

123 | The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience,
- Dongarra, Luszczek, et al.
- 2003
(Show Context)
Citation Context ... state of the art implementation of an LU factorization. It is used as the prominent metric in the evaluation of the performance level of the most powerful machines in the world issued by the Top 500 =-=[30]-=-. The algorithm and programming paradigm are very similar to the ScaLAPACK version of the LU factorization, but extremely tuned. Tuning. Parallel factorizations are controlled by several parameters: N... |

115 |
Analysis of programs for parallel processing.
- Bernstein
- 1966
(Show Context)
Citation Context ...on 6 provides the conclusion and future work. 2. Related work As early as 1966, the Bernstein conditions, which formalize the requirements for two tasks to be executed in parallel, have been isolated =-=[1]-=-; Direct Acyclic Graphs are a convenient abstraction of these conditions, with a large variety of applications [2]. More specifically, in a distributed context, the dataflow execution model is iconic ... |

112 | The data locality of work stealing.
- Acar, Blelloch, et al.
- 2000
(Show Context)
Citation Context ... been proposed in the past, in this work we implemented one that focuses solely on improving cache locality, as it has been demonstrated to be a major criterion for performant multithreaded execution =-=[20]-=-. Considering all of the many other approaches and selecting the most appropriate is out of the scope of this paper; the performance section will demonstrate that our approach is sufficient to demonst... |

99 | Netpipe: A network protocol independent performace evaluator," in
- Snell, Mikler, et al.
- 1996
(Show Context)
Citation Context ...ethernet, then with Myricom 10 Gb/s, and finally with Infiniband 20 Gb/s. From this time t we compute the average latency of the DAGuE engine. In Fig. 5 we compare these measurements with the NetPIPE =-=[32]-=- benchmark using the same MPI library. For all network types, a high overhead on small message latency is observed for DAGuE: from a factor of 10 on the double-1G Ethernet network to a factor of 90 on... |

81 | Parallel tiled QR factorization for multicore architectures,
- Buttari, Langou, et al.
- 2008
(Show Context)
Citation Context ...ation of an m n real matrix A has the form A = QR, where Q is an m m real orthogonal matrix and R is an m n real upper triangular matrix. A detailed tile QR algorithm description can be found in =-=[25]-=-. Fig. 2 shows the pseudocode of the Tile QR factorization. It relies on four basic operations implemented by four computational kernels for which reference implementations are freely available as par... |

81 |
A storage-efficient WY representation for products of Householder transformations,
- Schreiber, Loan
- 1988
(Show Context)
Citation Context ... lower triangular matrix V containing the Householder reflectors. The kernel also produces the upper triangular matrix T as defined by the compact WY technique for accumulating Householder reflectors =-=[26]-=-. The R factor overrides the upper triangular portion of the input and the reflectors override the lower triangular portion of the input. The T matrix is stored separately. DTSQRT: The kernel perfor... |

66 | The international Exascale software project roadmap. - Dongarra, Beckman, et al. - 2011 |

65 |
Numerical linear algebra on emerging architectures: The plasma and magma projects.
- Agullo, Demmel, et al.
- 2009
(Show Context)
Citation Context ...e approaches to building and managing the DAG during execution: [5] reads a concise representation of the DAG (in XML), and unrolls it in memory before scheduling it. Perez et al. [11], Agullo et al. =-=[8]-=-, Song et al. [12], and Augonnet et al. [13] modifies the sequential code with pragmas, to isolate tasks that will be run as an atomic entity, and runs the sequential code to discover the DAG. Optiona... |

58 |
Namyst R. hwloc: a generic framework for managing hardware affinities in HPC applications
- Broquedis, Clet-Ortega, et al.
(Show Context)
Citation Context ...aling and thus poor cache and NUMA locality. While this can be subject to further tuning, the queue size of 48 used currently in the engine performs well. The DAGuE environment uses the HWLOC library =-=[21]-=- to discover the NUMA architecture of the machine at runtime and discover architectural proximity. The JDF language and its internal representation at runtime, are specifically optimized to handle DAG... |

34 |
de Geijn. Updating an LU factorization with pivoting
- Quintana-Ortí, van
(Show Context)
Citation Context ...atrix, U is an n n real upper triangular matrix and P is a permutation matrix. It relies on four kernels: DGETRF, DTSTRF, DGESSM and DSSSSM. A detailed description of this algorithm can be found in =-=[24,27]-=-. 4.3. Expressivity and generality To avoid the pitfalls associated with general automatic parallelization, the DAGuE approach has to trade some generality to gain in performance. In most programming ... |

34 | ScaLAPACK: a linear algebra library for message-passing computers
- Blackford
- 1997
(Show Context)
Citation Context ... the OpenMPI 1.4.1, Plasma 2.1.0 and Intel Math Library MKL-10.1.0.015. Competing Benchmarks. We compare the performances of the DAGuE based factorizations with three other implementations. ScaLAPACK =-=[29]-=- is the reference implementation for distributed parallel machines of some of the LAPACK routines. Like LAPACK, ScaLAPACK routines are based on block partitioned algorithms to improve cache reuse and ... |

33 | de Geijn, Supermatrix: a multithreaded runtime scheduling system for algorithms-by-blocks, in: PPoPP ’08 - Chan, Zee, et al. |

24 | Multi-threading and one-sided communication in parallel LU factorization, in:
- Husbands, Yelick
- 2007
(Show Context)
Citation Context ...nent: in [15,16], the authors propose a centralized approach to schedule computational tasks on clusters of SMPs using a PTG representation and RPC calls based on the PM2 project. Husbands and Yelick =-=[17]-=- proposes an implementation of a tiled algorithm based on dynamic scheduling for the LU factorization on top of UPC. Gustavson et al. [18] uses a static scheduling of the Cholesky factorization on top... |

11 | Workflow global computing with YML
- Delannoy, Emad, et al.
- 2006
(Show Context)
Citation Context ...e specifically, in a distributed context, the dataflow execution model is iconic of DAG based approaches [3], which have mostly been applied, in the last two decades, to grid and peer-to-peer systems =-=[4,5]-=-. Recently, several projects [6–10], mostly in the field of linear algebra, have proposed to revive the general use of DAGs, as an approach to tackle the challenges of harnessing the power of multi-co... |

11 |
Distributed SBP Cholesky Factorization Algorithms with Near-Optimal Scheduling.
- Gustavson, Karlsson, et al.
- 2009
(Show Context)
Citation Context ...n and RPC calls based on the PM2 project. Husbands and Yelick [17] proposes an implementation of a tiled algorithm based on dynamic scheduling for the LU factorization on top of UPC. Gustavson et al. =-=[18]-=- uses a static scheduling of the Cholesky factorization on top of MPI to evaluate the impact of data representation structures. All of these projects address a single problem and propose ad hoc soluti... |

11 |
Minimal data copy for dense linear algebra factorization, in:
- Gustavson, Gunnels, et al.
- 2006
(Show Context)
Citation Context .... Tile algorithms We decided to consider the tile version of the linear algebra algorithms. Tile algorithms are based on the idea of processing the matrix by square sub-matrices, referred to as tiles =-=[22]-=-, of relatively small size, which makes the operations efficient in terms of cache and TLB use. More importantly in the context of DAGuE, tile algorithms provide more task parallelism than traditional... |

4 | Automatic multithreaded parallel program generation for message passing multiprocessors using parameterized task graphs,
- Jeannot
- 2001
(Show Context)
Citation Context ...erminism introduced by communications; and in addition to the dependencies themselves, data movements must be tracked between nodes. In the context of linear algebra, three projects are prominent: in =-=[15,16]-=-, the authors propose a centralized approach to schedule computational tasks on clusters of SMPs using a PTG representation and RPC calls based on the PM2 project. Husbands and Yelick [17] proposes an... |

4 |
Matrix algorithms, Society for Industrial and Applied Mathematics
- Stewart
- 2001
(Show Context)
Citation Context ...elism than traditional approaches for linear algebra operations, and thus is well suited for DAG scheduling strategies. Tile algorithms have a long history of research in the domain of linear algebra =-=[23]-=-, and their use for multicore shared memory architectures led to significant performance improvements [24]. 4.2. Matrix factorizations The QR factorization (or QR decomposition) offers a numerically s... |

3 | The impact of multicore on math software, in: Applied Parallel Computing - Buttari, Dongarra, et al. - 2006 |