Results 11  20
of
67
Predicting Multiprocessor Memory Access Patterns with Learning Models
, 1997
"... Machine learning techniques are applicable to computer system optimization. We show that shared memory multiprocessors can successfully utilize machine learning algorithms for memory access pattern prediction. In particular three different online machine learning prediction techniques were tested t ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Machine learning techniques are applicable to computer system optimization. We show that shared memory multiprocessors can successfully utilize machine learning algorithms for memory access pattern prediction. In particular three different online machine learning prediction techniques were tested to learn and predict repetitive memory access patterns for three typical parallel processing applications, the 2D relaxation algorithm, matrix multiply and Fast Fourier Transform on a shared memory multiprocessor. The predictions were then used by a routing control algorithm to reduce control latency in the interconnection network by configuring the interconnection network to provide needed memory access paths before they were requested. Three trainable prediction techniques were used and tested: 1). a Markov predictor, 2). a linear predictor and 3). a time delay neural network (TDNN) predictor. Different predictors performed best on different applications, but the TDNN produced uniformly go...
Efficient Matrix Chain Ordering in Polylog Time
 IN PROC. OF INT'L PARALLEL PROCESSING SYMP
, 1998
"... The matrix chain ordering problem is to find the cheapest way to multiply a chain of n matrices, where the matrices are pairwise compatible but of varying dimensions. Here we give several new parallel algorithms including O(lg 3 n)time and n/lg nprocessor algorithms for solving the matrix chain o ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
The matrix chain ordering problem is to find the cheapest way to multiply a chain of n matrices, where the matrices are pairwise compatible but of varying dimensions. Here we give several new parallel algorithms including O(lg 3 n)time and n/lg nprocessor algorithms for solving the matrix chain ordering problem and for solving an optimal triangulation problem of convex polygons on the common CRCW PRAM model. Next, by using efficient algorithms for computing row minima of totally monotone matrices, this complexity is improved to O(lg 2 n) time with n processors on the EREW PRAM and to O(lg 2 nlg lg n) time with n/lg lg n processors on a common CRCW PRAM. A new algorithm for computing the row minima of totally monotone matrices improves our parallel MCOP algorithm to O(nlg 1.5 n) work and polylog time on a CREW PRAM. Optimal logtime algorithms for computing row minima of totally monotone matrices will improve our algorithm and enable it to have the same work as the sequential algorithm of Hu and
Secure File Transfer: A Computational Analog to the Furniture Moving Paradigm
 PARALLEL AND DISTRIBUTED COMPUTING PRACTICES
, 1999
"... One of the most compelling illustrations of the power of parallelism is the furnituremoving paradigm. In it, a large item of furniture needs to be moved from one place to another. A single mover, working alone, must take the item apart, move each piece separately, and then reassemble the item a ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
One of the most compelling illustrations of the power of parallelism is the furnituremoving paradigm. In it, a large item of furniture needs to be moved from one place to another. A single mover, working alone, must take the item apart, move each piece separately, and then reassemble the item at the new location, taking a long time to complete the job. By contrast, four movers can simply lift the item and quickly move it to its new location. Thus, the time required to accomplish the task is reduced by a factor significantly larger than four. This paper describes a computational analog to the furnituremoving paradigm. The computation in question is concerned with transferring a computer file from one computer system to another over an insecure communications channel. The file contains private or sensitive information whose secrecy and integrity need to be maintained. Cryptography is used to obtain a digital signature of the file, thereby protecting its integrity, and the...
Complexity results for collective communications on heterogeneous platforms, in "Int
 Journal of High Performance Computing Applications
"... In this paper, we consider the communications involved in the execution of a complex application, deployed on a heterogeneous platform. Such applications extensively use macrocommunication schemes, for example to broadcast data items, either to all resources (broadcast) or to a restricted set of ta ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
In this paper, we consider the communications involved in the execution of a complex application, deployed on a heterogeneous platform. Such applications extensively use macrocommunication schemes, for example to broadcast data items, either to all resources (broadcast) or to a restricted set of targets (multicast). Rather than aiming at minimizing the execution time of a single collective communication, we focus on the steadystate operation. We assume that there is a large number of messages to be broadcast or multicast in pipelined fashion, and we aim at maximizing the throughput, i.e. the (rational) number of messages which can be broadcast or multicast every timestep. We target heterogeneous platforms, modeled by a graph where resources have different communication and computation speeds. Achieving the best throughput may well require that the target platform is used in totality: different messages may need to be transferred along different paths. The main focus of the paper is on complexity results. We aim at presenting a unified framework for analyzing the complexity of collective communication schemes. We concentrate on the classification (whether maximizing the throughput is a polynomial or NPhard problem), rather than actually providing efficient polynomial algorithms (when such algorithms are known, we refer to bibliographical pointers). Key words: scheduling, collective communications, NPcompleteness, broadcast, heuristics, heterogeneous clusters, grids 1
Automated Performance Prediction for Scalable Parallel Computing
 PARALLEL COMPUTING
, 1997
"... Performance prediction is necessary in order to deal with multidimensional performance effects on parallel systems. The compilergenerated analytical model developed in this paper accounts for the effects of cache behavior, CPU execution time and message passing overhead for real programs writte ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Performance prediction is necessary in order to deal with multidimensional performance effects on parallel systems. The compilergenerated analytical model developed in this paper accounts for the effects of cache behavior, CPU execution time and message passing overhead for real programs written in high level dataparallel languages. The performance prediction technique is shown to be effective in analyzing several nontrivial dataparallel applications as the problem size and number of processors vary. We leverage technology from the Maple symbolic manipulation system and the SPLUS statistical package in order to present users with critical performance information necessary for performance debugging, architectural enhancement and procurement of parallel systems. The usability of these results is improved through specifying confidence intervals as well as predicted execution times for dataparallel applications.
Parallel Two Level Block ILU Preconditioning Techniques for Solving Large Sparse Linear Systems
 Paral. Comput
, 2000
"... We discuss issues related to domain decomposition and multilevel preconditioning techniques which are often employed for solving large sparse linear systems in parallel computations. We introduce a class of parallel preconditioning techniques for general sparse linear systems based on a two level bl ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We discuss issues related to domain decomposition and multilevel preconditioning techniques which are often employed for solving large sparse linear systems in parallel computations. We introduce a class of parallel preconditioning techniques for general sparse linear systems based on a two level block ILU factorization strategy. We give some new data structures and strategies to construct local coefficient matrix and local Schur complement matrix in each processor. The preconditioner constructed is fast and robust for solving certain large sparse matrices. Numerical experiments show that our domain based two level block ILU preconditioners are more robust and more efficient than some published ILU preconditioners based on Schur complement techniques for parallel sparse matrix solutions.
MSP: a class of parallel multistep successive sparse approximate inverse preconditioning strategies
 SIAM J. Sci. Comput
, 2002
"... Abstract. We develop a class of parallel multistep successive preconditioning strategies to enhance efficiency and robustness of standard sparse approximate inverse preconditioning techniques. The key idea is to compute a series of simple sparse matrices to approximate the inverse of the original ma ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
Abstract. We develop a class of parallel multistep successive preconditioning strategies to enhance efficiency and robustness of standard sparse approximate inverse preconditioning techniques. The key idea is to compute a series of simple sparse matrices to approximate the inverse of the original matrix. Studies are conducted to show the advantages of such an approach in terms of both improving preconditioning accuracy and reducing computational cost, compared to the standard sparse approximate inverse preconditioners. Numerical experiments using one prototype implementation to solve a few sparse matrices on a distributed memory parallel computer are reported.
Network Performance Modeling for PVM Clusters
 IN PROCEEDINGS OF SUPERCOMPUTING '96, NOV
, 1996
"... The advantages of workstation clusters as a parallel computing platform include a superior priceperformance ratio, availability, scalability, and ease of incremental growth. However, the performance of traditional LAN technologies such as Ethernet and FDDI rings are insufficient for many parallel ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
The advantages of workstation clusters as a parallel computing platform include a superior priceperformance ratio, availability, scalability, and ease of incremental growth. However, the performance of traditional LAN technologies such as Ethernet and FDDI rings are insufficient for many parallel applications. This paper describes APACHE (Automated Pvm Application CHaracterization Environment) [16], an automated analysis system that uses an applicationindependent model for predicting the impact of ATM on the execution time of iterative parallel applications. APACHE has been used to predict the performance of several core applications that form the basis for many real scientific and engineering problems. We present a comparison of the performance predicted by APACHE with observed execution times to demonstrate the accuracy of our model. Finally, we present a method for a simple costbenefit analysis that can be used to determine whether an investment in ATM equipment is justified f...
A Portable 3D FFT Package for DistributedMemory Parallel Architectures
 Proceedings of 7th SIAM Conference on Parallel Processing
, 1995
"... A parallel algorithm for 3D FFTs is implemented as a series of local 1D FFTs combined with data transposes. This allows the use of vendor supplied (often fully optimized) sequential 1D FFTs. The FFTs are carried out inplace by using an inplace data transpose across the processors. 1 Introduction M ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
A parallel algorithm for 3D FFTs is implemented as a series of local 1D FFTs combined with data transposes. This allows the use of vendor supplied (often fully optimized) sequential 1D FFTs. The FFTs are carried out inplace by using an inplace data transpose across the processors. 1 Introduction Multidimensional FFTs are used frequently in engineering and scientific calculations, especially in image processing. Parallel implementations of FFT generally follow two approaches. One is the binaryexchange approach[1,2], where data exchanges take place in all pairs of processors with processor numbers differing by one bit. Another one is the transpose approach[2] for multidimensional FFTs, where a 3D FFT is carried out in 3 successive 1D local sequential FFTs with data transposes occurring in between. Interprocessor communication only take place in these data transpose. One advantage of this approach is that we can use the vendorsupplied 1D FFTs, which are often fully optimized. Furth...
A dynamic coallocation service in multicluster systems
 In
, 2004
"... In multicluster systems, and more generally in grids, jobs may require coallocation, i.e., the simultaneous allocation of resources such as processors in multiple clusters to improve their performance. In previous work, we have studied processor coallocation through simulations. Here, we extend th ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
In multicluster systems, and more generally in grids, jobs may require coallocation, i.e., the simultaneous allocation of resources such as processors in multiple clusters to improve their performance. In previous work, we have studied processor coallocation through simulations. Here, we extend this work with the design and implementation of a dynamic processor coallocation service in multicluster systems. While an implementation of basic coallocation mechanisms has existed for some years in the form of the DUROC component of the Globus Toolkit, DUROC does not provide resourcebrokering functionality or fault tolerance in the face of job submission or completion failures. Our design adds these two elements in the form of a software layer on top of DUROC. We have performed experiments that show that our coallocation service works reliably. 1