#### DMCA

## Run-time Compilation for Parallel Sparse Matrix Computations (1996)

Venue: | In Proceedings of ACM International Conference on Supercomputing |

Citations: | 18 - 10 self |

### Citations

262 | A supernodal approach to sparse partial pivoting,”
- Demmel, Eisenstat, et al.
- 1999
(Show Context)
Citation Context ...and the multi-processor speedups are good considering the current status of parallel sparse LU factorization research. This work is still preliminary and we are comparing it with other sparse LU code =-=[5]-=-. Matrix P=1 P=2 P=4 P=8 P=16 sherman5 2.55 4.41 7.93 14.1 17.2 lnsp3937 2.33 4.08 7.04 13.6 15.0 lns3937 2.01 3.60 6.17 12.0 14.6 sherman3 3.77 6.50 11.1 19.3 20.3 jpwh991 3.01 5.15 9.18 16.7 16.0 or... |

209 | DSC: scheduling parallel tasks on an unbounded number of processors,”
- Yang, Gerasoulis
- 1994
(Show Context)
Citation Context ... first stage, we cluster tasks to a set of threads (or directly call them clusters) to reduce communication and exploit the data locality. Two clustering strategies are used: 1) Use the DSC algorithm =-=[21]-=-. 2) Form clusters based on the data accessing patterns. If tasks write or modify the same data object, they will be assigned into one cluster. This data-driven approach is essentially following the o... |

144 | Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures.
- Das, Uysal, et al.
- 1994
(Show Context)
Citation Context ...r structured codes have been shown successful in many application domains. However it is still difficult to parallelize unstructured codes, which can be found in many scientific applications [18]. In =-=[4]-=- an important class of unstructured and sparse problems which involve iterative computations is identified and has been successfully parallelized using the inspector/executor approach. The cost of opt... |

141 |
A General Approach to Mapping of Parallel Computation upon Multiprocessor Architectures,” available at: ftp://ftp.cs.utexas.edu/pub/ techreports/ tr88-04. pdf, last visited
- Kim
- 1988
(Show Context)
Citation Context ...llelism and also determine the execution order of commuting operations so as to minimize parallel time. Algorithms for static scheduling of DAGs have been extensively studied in the literatures, e.g. =-=[10, 15, 20]-=-. The main optimizations are eliminating unnecessary communication to exploit data locality, overlapping communication with computation to hide communication latency, and exploiting task concurrency t... |

134 |
Parallel algorithms for sparse linear systems, Parallel Algorithms for Matrix Computations,
- Heath, Ng, et al.
- 1990
(Show Context)
Citation Context ...own in Figure 2, matrix A is partitioned into N \Theta N submatrices. A partitioning example is shown in Figure 4. Notice that this submatrix partitioning is not uniform due to supernode partitioning =-=[9, 14]-=-. We assume that the nonzero structure information is available after symbolic factorization and supernode partitioning. These operations are performed before task specification. Each data object is d... |

127 |
Partitioning and scheduling parallel programs for execution on multiprocessors.
- Sarkar
- 1987
(Show Context)
Citation Context ...s to demonstrate how we are able to deliver good performance for sparse codes and our future work will address the automatic generation of inspector specification code and automatic task partitioning =-=[2, 11, 15]-=-. Cholesky factorization is performed on a symmetric positive definite matrix A of size n \Theta n. In a block sparse Cholesky algorithm as shown in Figure 2, matrix A is partitioned into N \Theta N s... |

117 |
Parallel programming and compilers.
- Polychronopoulos
- 1988
(Show Context)
Citation Context ...s to demonstrate how we are able to deliver good performance for sparse codes and our future work will address the automatic generation of inspector specification code and automatic task partitioning =-=[2, 11, 15]-=-. Cholesky factorization is performed on a symmetric positive definite matrix A of size n \Theta n. In a block sparse Cholesky algorithm as shown in Figure 2, matrix A is partitioned into N \Theta N s... |

91 | PYRROS: static task scheduling and code generation for message passing multiprocessors
- Yang, Gerasoulis
- 1992
(Show Context)
Citation Context ...ization dominates the computation at each iteration, an effective run-time optimization at the inspector stage could improve the code performance at the executor stage substantially. Previous results =-=[1, 8, 17, 20]-=- have demonstrated that graph scheduling can effectively exploit irregular task parallelism if task dependencies are given explicitly. We generalize the previous work and discuss a run-time library sy... |

60 |
What’s in a Name? or the Value of Renaming for Parallelism Detection and Storage Allocation,
- Cytron, Ferrante
(Show Context)
Citation Context ...bjects produced by its predecessors arrive at the local processor. With the presence of anti and output dependencies, run-time synchronization becomes more complicated. We can use renaming techniques =-=[3]-=- to remove these dependencies, however it needs additional memory optimization. We use the following simple strategy to remove output and anti dependence. 1) First we delete all the redundant edges fo... |

54 |
Automatic task graph generation techniques
- Cosnard, Loi
- 1995
(Show Context)
Citation Context ...s to demonstrate how we are able to deliver good performance for sparse codes and our future work will address the automatic generation of inspector specification code and automatic task partitioning =-=[2, 11, 15]-=-. Cholesky factorization is performed on a symmetric positive definite matrix A of size n \Theta n. In a block sparse Cholesky algorithm as shown in Figure 2, matrix A is partitioned into N \Theta N s... |

48 | Highly parallel sparse Cholesky factorization,
- Gilbert
- 1990
(Show Context)
Citation Context ...own in Figure 2, matrix A is partitioned into N \Theta N submatrices. A partitioning example is shown in Figure 4. Notice that this submatrix partitioning is not uniform due to supernode partitioning =-=[9, 14]-=-. We assume that the nonzero structure information is available after symbolic factorization and supernode partitioning. These operations are performed before task specification. Each data object is d... |

43 | Experience with Active Messages on the Meiko CS-2
- Schauser, Scheiman
- 1995
(Show Context)
Citation Context ... only requirement is that a processor needs to allocate a dedicated space for each data object it needs. RMA can be implemented in modern multi-processor architectures such as Cray-T3D and Meiko CS-2 =-=[16]-=-. We have implemented our system on Meiko CS-2 which provides Direct Memory Access(DMA) as the major way to access non-local memory. Eliminating redundant communication and synchronization.sA task may... |

41 |
Exploiting the Memory Hierarchy in Sequential and Parallel Sparse Cholesky Factorization.
- Rothberg
- 1993
(Show Context)
Citation Context ...cases pivoting is needed to maintain the numerical stability. Because partial pivoting operations make the data structures change dynamically, it has been an open problem to 1 In comparison, Rothberg =-=[13]-=- reported that the sequential performance of his code achieved 77% of LINPACK for BCSSTK15 and 71% in average for other matrices on IBM RS/6000 Model 320. Matrix P=2 P=4 P=8 P=16 P=32 BCSSTK15 4% 4% 1... |

22 | Multiprocessor Runtime Support for Fine-Grained Irregular DAGs
- Chong, Sharma, et al.
- 1995
(Show Context)
Citation Context ...ization dominates the computation at each iteration, an effective run-time optimization at the inspector stage could improve the code performance at the executor stage substantially. Previous results =-=[1, 8, 17, 20]-=- have demonstrated that graph scheduling can effectively exploit irregular task parallelism if task dependencies are given explicitly. We generalize the previous work and discuss a run-time library sy... |

22 | Run-time support for portable distributed data structures
- Wen, Chakrabarti, et al.
- 1995
(Show Context)
Citation Context ...niques for structured codes have been shown successful in many application domains. However it is still difficult to parallelize unstructured codes, which can be found in many scientific applications =-=[18]-=-. In [4] an important class of unstructured and sparse problems which involve iterative computations is identified and has been successfully parallelized using the inspector/executor approach. The cos... |

15 |
Scheduling of Structured and Unstructured Computation
- Gerasoulis, Jiao, et al.
- 1995
(Show Context)
Citation Context ...ization dominates the computation at each iteration, an effective run-time optimization at the inspector stage could improve the code performance at the executor stage substantially. Previous results =-=[1, 8, 17, 20]-=- have demonstrated that graph scheduling can effectively exploit irregular task parallelism if task dependencies are given explicitly. We generalize the previous work and discuss a run-time library sy... |

7 | Efficient Run-time Support for Irregular Task Computations with Mixed Granularities
- Fu, Yang
- 1996
(Show Context)
Citation Context ...ing stage does not suffice to produce efficient code. A careful design in the task communication model is further required to execute the pre-optimized task computation and communication schedule. In =-=[6]-=- we have developed an efficient run-time task communication protocol for executing general irregular task computations with mixed granularities. The distinguishing feature of the protocol is that we t... |

7 |
Performance of distributed sparse cholesky factorization with pre-scheduling
- Venugopal, Naik, et al.
- 1992
(Show Context)
Citation Context |

5 | Communication optimizations for parallel computing using data access information
- Rinard
- 1995
(Show Context)
Citation Context ... task communication protocol. The design of library functions is based on three concepts: distributed shared data objects, tasks and access specifications. Similar concepts have been proposed in JADE =-=[12]-=- which extracts task dependence and schedules tasks dynamically. Such an approach has a flexibility to handle problems with adaptive structures; however, it is still an open problem to balance the ben... |