Results 1 - 10
of
16
Parallel Programmer Productivity: A Case Study of Novice
- Parallel Programmers,” International Conference for High Performance Computing, Networking and Storage
, 2005
"... In developing High-Performance Computing (HPC) software, time to solution is an important metric. This metric is comprised of two main components: the human effort required developing the software, plus the amount of machine time required to execute it. To date, little empirical work has been done t ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
In developing High-Performance Computing (HPC) software, time to solution is an important metric. This metric is comprised of two main components: the human effort required developing the software, plus the amount of machine time required to execute it. To date, little empirical work has been done to study the first component: the human effort required and the effects of approaches and practices that may be used to reduce it. In this paper, we describe a series of studies that address this problem. We instrumented the development process used in multiple HPC classroom environments. We analyzed data within and across such studies, varying factors such as the parallel programming model used and the application being developed, to understand their impact on the development process. 1
Experiments with List Ranking for Explicit Multi-Threaded (XMT) Instruction Parallelism
- Proc. 3rd Workshop on Algorithms Engineering (WAE-99
, 1999
"... Algorithms for the problem of list ranking are empirically studied for the Explicit Multi-Threaded (XMT) platform for instruction parallelism. The main goal of this study is to understand the the differences between XMT and more traditional parallel computing implementation platforms/models as they ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Algorithms for the problem of list ranking are empirically studied for the Explicit Multi-Threaded (XMT) platform for instruction parallelism. The main goal of this study is to understand the the differences between XMT and more traditional parallel computing implementation platforms/models as they pertain to the well studied list ranking problem. The main two findings are: (i) Good speedups for much smaller inputs are possible. (ii) In part, this finding is based on competitive performance by a new variant of [Vi84], called the No-Cut algorithm. The paper incorporates analytic (non-asymptotic) performance analysis into experimental performance analysis for relatively small inputs. This provides an interesting example where experimental research and theoretical analysis complement one another. Explicit Multi-Threading (XMT) is a fine-grained computation framework. XMT covers the spectrum from algorithms through architecture to implementation; the main innovation in XMT (in...
Models for Advancing PRAM and Other ALgorithms into Parallel Programs For A Pram-on-chip Platform
"... ..."
Empirical Study Design in the Area of High Performance Computing (HPC)
- IN PROCEEDINGS OF INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING. NOOSA HEADS
"... The development of High-Performance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant st ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
The development of High-Performance Computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant study. Because of key differences between development for HPC and for more conventional software engineering applications, this work has required the tailoring of experimental designs and protocols. A major contribution of our work is to begin to quantify the code development process in a specialized area that has previously not been extensively studied. Specifically, we present an analysis of the domain of High-Performance Computing for the aspects that would impact experimental design; show how those aspects are reflected in experimental design for this specific area; and demonstrate how we are using such experimental designs to build up a body of knowledge specific to the domain. Results to date build confidence in our approach by showing that there are no significant differences across studies comparing subjects with similar experience tackling similar problems, while there are significant differences in performance and effort among the different parallel models applied.
A No-Busy-Wait Balanced Tree Parallel Algorithmic Paradigm
, 2000
"... Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem? Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallel algorithms; applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage); the latter is known to be applicable to evaluating a family of general arithmetic expressions. For putting things in context, we also discuss our "PRAM-o...
Evaluating Multi-threading in the Prototype XMT Environment
- In Proc. 4th Workshop on Multi-Threaded Execution, Architecture and Compliation (MTEAC2000
, 2000
"... XMT is a multi-threaded programming model designed to exploit explicit specification of parallel threads. Its main features are a simple thread execution model and an efficient prefix-sum instruction for synchronizing shared data accesses. This paper presents and evaluates the performance of multith ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
XMT is a multi-threaded programming model designed to exploit explicit specification of parallel threads. Its main features are a simple thread execution model and an efficient prefix-sum instruction for synchronizing shared data accesses. This paper presents and evaluates the performance of multithreading in the XMT programming environment. A prototype XMT compiler converts parallel regions into procedure calls, which are then executed efficiently in XMT hardware. An architecture simulator similar to SimpleScalar is used to evaluate the performance of the XMT system for twelve benchmark codes. Results show the XMT architecture generally succeeds in providing low-overhead parallel threads and uniform access times on-chip. However, compiler optimizations to cluster (coarsen) threads are still needed for very fine-grained threads. Keywords Parallel programming, compilers, processor architectures. 1. Introduction Conditional branches, variable memory access latencies, and other barrie...
Arbitrate-and-move primitives for high throughput on-chip interconnection networks
- Proc. IEEE International Symposium on Circuits and Systems (ISCAS), Volume II
"... An n-leaf pipelined balanced binary tree is used for arbitration of order and movement of data from n input ports to one output port. A novel arbitrate-and-move primitive circuit for every node of the tree, which is based on a concept of reduced synchrony that benefits from attractive features of bo ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
An n-leaf pipelined balanced binary tree is used for arbitration of order and movement of data from n input ports to one output port. A novel arbitrate-and-move primitive circuit for every node of the tree, which is based on a concept of reduced synchrony that benefits from attractive features of both asynchronous and synchronous designs, is presented. The design objective of the pipelined binary tree is to provide a key building block in a high-throughput mesh-of-trees interconnection network for Explicit Multi Threading (XMT) architecture, a recently introduced parallel computation framework. The proposed reduced synchrony circuit was compared with asynchronous and synchronous designs of arbitrate-and-move primitives. Simulations with 0.18µm technology show that compared to an asynchronous design, the proposed reduced synchrony implementation achieves a higher throughput, up to 2 Giga-Requests per second on an 8-leaf binary tree. Our circuit also consumes less power than the synchronous design, and requires less silicon area than both the synchronous and asynchronous designs. 1.
XMT-M: A Scalable Decentralized Processor
, 1999
"... A defining challenge for research in computer science and engineering has been the ongoing quest for reducing the completion time of a single computation task. Even outside the parallel processing communities, there is little doubt that the key to further progress in this quest is to do parallel pro ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
A defining challenge for research in computer science and engineering has been the ongoing quest for reducing the completion time of a single computation task. Even outside the parallel processing communities, there is little doubt that the key to further progress in this quest is to do parallel processing of some kind. A recently proposed parallel processing framework that spans the entire spectrum from (parallel) algorithms to architecture to implementation is the explicit multi-threading (XMT) framework. This framework provides: (i) simple and natural parallel algorithms for essentially every general-purpose application, including notoriously difficult irregular integer applications, and (ii) a multi-threaded programming model for these algorithms which allows an "independence-of-order" semantics: every thread can proceed at its own speed, independent of other concurrent threads. To the extent possible, the XMT framework uses established ideas in parallel processing. This paper pre...
Dynamic Load Balancing Issues In The Earth Runtime System
, 1999
"... Multithreading is a promising approach to address the problems inherent in multiprocessor systems, such as network and synchronization latencies. Moreover, the benefits of multithreading are not limited to loop-based algorithms but apply also to irregular parallelism. EARTH - Efficient Architecture ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Multithreading is a promising approach to address the problems inherent in multiprocessor systems, such as network and synchronization latencies. Moreover, the benefits of multithreading are not limited to loop-based algorithms but apply also to irregular parallelism. EARTH - Efficient Architecture for Running THreads, is a multithreaded model supporting fine-grain, non-preemptive threads. This model is supported by a C-based runtime system which provides the multithreaded environment for the execution of concurrent programs. This thesis describes the design and implementation of a set of dynamic load balancing algorithms, and an in-depth study of their behavior with divide-and-conquer, regular, and irregular classes of applications. The results described in this thesis are based on EARTH-SP2, an implementation of the EARTH program execution model on the IBM SP-2, a distributed memory multiprocessor system. The main results of this study are as follows: ffl A randomizing load balance...

