Results 11  20
of
287
Scalable Problems and MemoryBounded Speedup
, 1992
"... In this paper three models of parallel speedup are studied. They are fixedsize speedup, fixedtime speedup and memorybounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set consi ..."
Abstract

Cited by 57 (15 self)
 Add to MetaCart
In this paper three models of parallel speedup are studied. They are fixedsize speedup, fixedtime speedup and memorybounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set considers uneven workload allocation and communication overhead and gives more accurate estimation. Another set considers a simplified case and provides a clear picture on the impact of the sequential portion of an application on the possible performance gain from parallel processing. The simplified fixedsize speedup is Amdahl's law. The simplified fixedtime speedup is Gustafson's scaled speedup. The simplified memorybounded speedup contains both Amdahl's law and Gustafson's scaled speedup as special cases. This study leads to a better understanding of parallel processing.
An Extensible MetaLearning Approach for Scalable and Accurate Inductive Learning
, 1996
"... Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Som ..."
Abstract

Cited by 51 (8 self)
 Add to MetaCart
Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining. One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. Moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. In this thesis we propose, investigate, and evaluate a metalearning approach to integrating the results of mul...
Unstructured Tree Search on SIMD Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mech ..."
Abstract

Cited by 41 (15 self)
 Add to MetaCart
(Show Context)
In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mechanism, which determines when the search space redistribution must occur to balance search space over processors; and (ii) a scheme to redistribute the search space. We have devised a new redistribution mechanism and a new triggering mechanism. Either of these can be used in conjunction with triggering and redistribution mechanisms developed by other researchers. We analyze the scalability of these mechanisms, and verify the results experimentally. The analysis and experiments show that our new load balancing methods are highly scalable on SIMD architectures. Their scalability is shown to be no worse than that of the best load balancing schemes on MIMD architectures. We verify our theoretical...
Scalability of parallel algorithms for the allpairs shortest path problem
 in the Proceedings of the International Conference on Parallel Processing
, 1991
"... Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small vari ..."
Abstract

Cited by 40 (14 self)
 Add to MetaCart
(Show Context)
Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small variations of them. Scalability is analyzed with respect to mesh, hypercube and sharedmemory architectures. We demonstrate that isoefficiency functions are a compact and useful predictor of performance. In fact, previous comparative predictions of some of the algorithms based on experimental results are shown to be incorrect whereas isoefficiency functions predict correctly. We find the classic tradeoffs of hardware cost vs. time and memory vs. time to be represented here as tradeoffs of hardware cost vs. scalability and memory vs. scalability.
SingleChip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs
 In MICRO43: Proceedings of the 43th Annual IEEE/ACM International Symposium on Microarchitecture
, 2010
"... To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a firstclass priority. Singlechip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (Ucor ..."
Abstract

Cited by 33 (2 self)
 Add to MetaCart
(Show Context)
To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a firstclass priority. Singlechip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (Ucores) such as custom logic, FPGAs, or GPGPUs. Although Ucores are effective at increasing performance, their benefits can also diminish given the scarcity of projected bandwidth in the future. To understand the relative merits between different approaches in the face of technology constraints, this work builds on prior modeling of heterogeneous multicores to support Ucores. Unlike prior models that trade performance, power, and area using wellknown relationships between simple and complex processors, our model must consider the lessobvious relationships between conventional processors and a diverse set of Ucores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon Ucorespecific parameters derived by measuring performance and power of tuned applications on today’s stateoftheart multicores, GPUs, FPGAs, and ASICs. Our results reinforce some currentday understandings of the potential and limitations of Ucores and also provides new insights on their relative merits. 1.
Selected problems of scheduling tasks in multiprocessor computing systems
 PHD THESIS, INSTYTUT INFORMATYKI POLITECHNIKA POZNANSKA
, 1997
"... ..."
Performance and scalability of preconditioned conjugate gradient methods on parallel computers
 Department of Computer Science, University of Minnesota
, 1995
"... ..."
Parallel program performance prediction using deterministic task graph analysis
 ACM Trans. Comput. Syst
, 2004
"... In this paper, we consider analytical techniques for predicting detailed performance characteristics of a single shared memory parallel program for a particular input. Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on program scalabil ..."
Abstract

Cited by 28 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we consider analytical techniques for predicting detailed performance characteristics of a single shared memory parallel program for a particular input. Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on program scalability, but have been less successful in practice for providing detailed insights and metrics for program performance (leaving these to measurement or simulation). We develop a conceptually simple modeling technique called deterministic task graph analysis that provides detailed performance prediction for sharedmemory programs with arbitrary task graphs, a wide variety of task scheduling policies, and significant communication and resource contention. Unlike many previous models that are stochastic models, our model assumes deterministic task execution times (while retaining the use of stochastic models for communication and resource contention). This assumption is supported by a previous study of the influence of nondeterministic delays in parallel programs. We evaluate our model in three ways. First, an experimental evaluation shows that our analysis technique is accurate and efficient for a variety of sharedmemory programs, including programs with large and/or complex task graphs, sophisticated task scheduling, highly nonuniform task
Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures
 Journal of Computational and Graphical Statistics
, 2010
"... This paper describes advances in statistical computation for largescale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to incr ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
This paper describes advances in statistical computation for largescale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large data sets. An example context concerns common biological studies using highthroughput technologies generating many, very large data sets and requiring increasingly highdimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, examples of the benefits of GPU implementations in terms of processing speed and scaleup in ability to analyze large data sets, and provide a detailed, tutorialstyle exposition that will benefit readers interested in developing GPUbased approaches in other statistical models. Novel, GPUoriented approaches to modifying existing algorithms software design can lead to vast speedup and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data and details that will enable readers to implement and explore the GPU approach in this mixture modelling context.
Scalability Issues Affecting the Design of a Dense Linear Algebra Library
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1994
"... This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers ..."
Abstract

Cited by 25 (12 self)
 Add to MetaCart
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widelyused LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on blockpartitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel blockpartitioned algorithms is given. Approximate models of algorithms' performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128node Intel iPSC/860 hypercube. It is shown that the routines are highl...