Results 11  20
of
314
Scalable Problems and MemoryBounded Speedup
, 1992
"... In this paper three models of parallel speedup are studied. They are fixedsize speedup, fixedtime speedup and memorybounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set consi ..."
Abstract

Cited by 65 (16 self)
 Add to MetaCart
In this paper three models of parallel speedup are studied. They are fixedsize speedup, fixedtime speedup and memorybounded speedup. The latter two consider the relationship between speedup and problem scalability. Two sets of speedup formulations are derived for these three models. One set considers uneven workload allocation and communication overhead and gives more accurate estimation. Another set considers a simplified case and provides a clear picture on the impact of the sequential portion of an application on the possible performance gain from parallel processing. The simplified fixedsize speedup is Amdahl's law. The simplified fixedtime speedup is Gustafson's scaled speedup. The simplified memorybounded speedup contains both Amdahl's law and Gustafson's scaled speedup as special cases. This study leads to a better understanding of parallel processing.
An Extensible MetaLearning Approach for Scalable and Accurate Inductive Learning
, 1996
"... Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Som ..."
Abstract

Cited by 51 (8 self)
 Add to MetaCart
Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining. One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. Moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. In this thesis we propose, investigate, and evaluate a metalearning approach to integrating the results of mul...
Parallel Programming
 in C with MPI and OpenMP. McGrawHill Inc
"... ftAbstract approved ..."
(Show Context)
Unstructured Tree Search on SIMD Parallel Computers
 IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mech ..."
Abstract

Cited by 42 (16 self)
 Add to MetaCart
(Show Context)
In this paper, we present new methods for load balancing of unstructured tree computations on largescale SIMD machines, and analyze the scalability of these and other existing schemes. An efficient formulation of tree search on a SIMD machine comprises of two major components: (i) a triggering mechanism, which determines when the search space redistribution must occur to balance search space over processors; and (ii) a scheme to redistribute the search space. We have devised a new redistribution mechanism and a new triggering mechanism. Either of these can be used in conjunction with triggering and redistribution mechanisms developed by other researchers. We analyze the scalability of these mechanisms, and verify the results experimentally. The analysis and experiments show that our new load balancing methods are highly scalable on SIMD architectures. Their scalability is shown to be no worse than that of the best load balancing schemes on MIMD architectures. We verify our theoretical...
Evaluation of Design Choices for Gang Scheduling using Distributed Hierarchical Control
 journal of Parallel and Distributed Computing
, 1996
"... Gang scheduling  the scheduling of a number of related threads to execute simultaneously on distinct processors  appears to meet the requirements of interactive, multiuser, generalpurpose parallel systems. Distributed Hierarchical Control (DHC) has been proposed as an efficient mechanism fo ..."
Abstract

Cited by 42 (12 self)
 Add to MetaCart
(Show Context)
Gang scheduling  the scheduling of a number of related threads to execute simultaneously on distinct processors  appears to meet the requirements of interactive, multiuser, generalpurpose parallel systems. Distributed Hierarchical Control (DHC) has been proposed as an efficient mechanism for coping with the dynamic processor partitioning necessary to support gang scheduling on massively parallel machines. In this paper, we compare and evaluate different algorithms that can be used within the DHC framework. Regrettably, gang scheduling can leave processors idle if the sizes of the gangs do not match the number of available processors. We show that in DHC this effect can be reduced by reclaiming the leftover processors when the gang size is smaller than the allocated block of processors, and by adjusting the scheduling time quantum to control the adverse effect of badlymatched gangs. Consequently the online mapping and scheduling algorithms developed for DHC are optimal in the sense that asymptotically they achieve performance commensurate with offline algorithms. Keywords: Distributed Hierarchical Control, fragmentation, gang scheduling, load balancing, mapping, processor utilization, variable partitioning. Parts of this research have been presented at conferences [18, 19]. Part of this work was done while at the IBM T. J. Watson Research Center, Yorktown Heights, NY 10598. 1 1
Scalability of parallel algorithms for the allpairs shortest path problem
 in the Proceedings of the International Conference on Parallel Processing
, 1991
"... Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small vari ..."
Abstract

Cited by 40 (14 self)
 Add to MetaCart
(Show Context)
Abstract This paper uses the isoefficiency metric to analyze the scalability of several parallel algorithms for finding shortest paths between all pairs of nodes in a densely connected graph. Parallel algorithms analyzed in this paper have either been previously presented elsewhere or are small variations of them. Scalability is analyzed with respect to mesh, hypercube and sharedmemory architectures. We demonstrate that isoefficiency functions are a compact and useful predictor of performance. In fact, previous comparative predictions of some of the algorithms based on experimental results are shown to be incorrect whereas isoefficiency functions predict correctly. We find the classic tradeoffs of hardware cost vs. time and memory vs. time to be represented here as tradeoffs of hardware cost vs. scalability and memory vs. scalability.
SingleChip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs
 In MICRO43: Proceedings of the 43th Annual IEEE/ACM International Symposium on Microarchitecture
, 2010
"... To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a firstclass priority. Singlechip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (Ucor ..."
Abstract

Cited by 36 (3 self)
 Add to MetaCart
(Show Context)
To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a firstclass priority. Singlechip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (Ucores) such as custom logic, FPGAs, or GPGPUs. Although Ucores are effective at increasing performance, their benefits can also diminish given the scarcity of projected bandwidth in the future. To understand the relative merits between different approaches in the face of technology constraints, this work builds on prior modeling of heterogeneous multicores to support Ucores. Unlike prior models that trade performance, power, and area using wellknown relationships between simple and complex processors, our model must consider the lessobvious relationships between conventional processors and a diverse set of Ucores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon Ucorespecific parameters derived by measuring performance and power of tuned applications on today’s stateoftheart multicores, GPUs, FPGAs, and ASICs. Our results reinforce some currentday understandings of the potential and limitations of Ucores and also provides new insights on their relative merits. 1.
Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures
 Journal of Computational and Graphical Statistics
, 2010
"... This paper describes advances in statistical computation for largescale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to incr ..."
Abstract

Cited by 32 (9 self)
 Add to MetaCart
This paper describes advances in statistical computation for largescale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large data sets. An example context concerns common biological studies using highthroughput technologies generating many, very large data sets and requiring increasingly highdimensional mixture models with large numbers of mixture components. We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, examples of the benefits of GPU implementations in terms of processing speed and scaleup in ability to analyze large data sets, and provide a detailed, tutorialstyle exposition that will benefit readers interested in developing GPUbased approaches in other statistical models. Novel, GPUoriented approaches to modifying existing algorithms software design can lead to vast speedup and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplemental materials are provided with all source code, example data and details that will enable readers to implement and explore the GPU approach in this mixture modelling context.
Parallel program performance prediction using deterministic task graph analysis
 ACM Trans. Comput. Syst
, 2004
"... In this paper, we consider analytical techniques for predicting detailed performance characteristics of a single shared memory parallel program for a particular input. Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on program scalabil ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
(Show Context)
In this paper, we consider analytical techniques for predicting detailed performance characteristics of a single shared memory parallel program for a particular input. Analytical models for parallel programs have been successful at providing simple qualitative insights and bounds on program scalability, but have been less successful in practice for providing detailed insights and metrics for program performance (leaving these to measurement or simulation). We develop a conceptually simple modeling technique called deterministic task graph analysis that provides detailed performance prediction for sharedmemory programs with arbitrary task graphs, a wide variety of task scheduling policies, and significant communication and resource contention. Unlike many previous models that are stochastic models, our model assumes deterministic task execution times (while retaining the use of stochastic models for communication and resource contention). This assumption is supported by a previous study of the influence of nondeterministic delays in parallel programs. We evaluate our model in three ways. First, an experimental evaluation shows that our analysis technique is accurate and efficient for a variety of sharedmemory programs, including programs with large and/or complex task graphs, sophisticated task scheduling, highly nonuniform task
Performance and scalability of preconditioned conjugate gradient methods on parallel computers
 Department of Computer Science, University of Minnesota
, 1995
"... ..."