Results 1  10
of
16
Reducing i/o complexity by simulating coarse grained parallel algorithms
 In Proc. IPPS/SPDP
, 1999
"... Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a deterministic simulation technique which transforms parallel algorithms into ( ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a deterministic simulation technique which transforms parallel algorithms into (parallel) external memory algorithms. Specifically, we present a deterministic simulation technique which transforms Coarse Grained Multicomputer (CGM) algorithms into external memory algorithms for the Parallel Disk Model. Our technique optimizes blockwise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel external memory algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including 3D convex hulls (2D Voronoi diagrams), and various graph problems. All of the (parallel) external memory algorithms obtained via simulation are analyzed with respect to the computation time, communication time and the number of I/O’s. Our results answer to the challenge posed by the ACM working group on storage I/O for largescale computing [8]. 1
The Design and Analysis of BulkSynchronous Parallel Algorithms
, 1998
"... The model of bulksynchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles sharedmemory s ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
The model of bulksynchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles sharedmemory style programming with efficient exploitation of data locality. The BSPRAM model can be optimally simulated by a BSP computer for a broad range of algorithms possessing certain characteristic properties: obliviousness, slackness, granularity. We use BSPRAM to design BSP algorithms for problems from three large, partially overlapping domains: combinatorial computation, dense matrix computation, graph computation. Some of the presented algorithms are adapted from known BSP algorithms (butterfly dag computation, cube dag computation, matrix multiplication). Other algorithms are obtained by application of established nonBSP techniques (sorting, randomised list contraction, Gaussian elimination without pivoting and with column pivoting, algebraic path computation), or use original techniques specific to the BSP model (deterministic list contraction, Gaussian elimination with nested block pivoting, communicationefficient multiplication of Boolean matrices, synchronisationefficient shortest paths computation). The asymptotic BSP cost of each algorithm is established, along with its BSPRAM characteristics. We conclude by outlining some directions for future research.
A unified model for multicore architectures
 In Proc. 1st International Forum on NextGeneration Multicore/Manycore Technologies
, 2008
"... With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
With the advent of multicore and many core architectures, we are facing a problem that is new to parallel computing, namely, the management of hierarchical parallel caches. One major limitation of all earlier models is their inability to model multicore processors with varying degrees of sharing of caches at different levels. We propose a unified memory hierarchy model that addresses these limitations and is an extension of the MHG model developed for a single processor with multimemory hierarchy. We demonstrate that our unified framework can be applied to a number of multicore architectures for a variety of applications. In particular, we derive lower bounds on memory traffic between different levels in the hierarchy for financial and scientific computations. We also give a multicore algorithms for a financial
Bulk Synchronous Parallel Algorithms for the External Memory Model
, 2002
"... Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchr ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
Blockwise access to data is a central theme in the design of efficient external memory (EM) algorithms. A second important issue, when more than one disk is present, is fully parallel disk I/O. In this paper we present a simple, deterministic simulation technique which transforms certain Bulk Synchronous Parallel (BSP) algorithms into efficient parallel EM algorithms. It optimizes blockwise data access and parallel disk I/O and, at the same time, utilizes multiple processors connected via a communication network or shared memory. We obtain new improved parallel EM algorithms for a large number of problems including sorting, permutation, matrix transpose, several geometric and GIS problems including threedimensional convex hulls (twodimensional Voronoi diagrams), and various graph problems. We show that certain parallel algorithms known for the BSP model can be used to obtain EM algorithms that meet well known I/O complexity lower bounds for various problems, including sorting.
Towards a Scalable Parallel Object Database  The Bulk Synchronous Parallel Approach
, 1996
"... Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in nonnumerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make generalpurpose parallel computing co ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
Parallel computers have been successfully deployed in many scientific and numerical application areas, although their use in nonnumerical and database applications has been scarce. In this report, we first survey the architectural advancements beginning to make generalpurpose parallel computing costeffective, the requirements for nonnumerical (or symbolic) applications, and the previous attempts to develop parallel databases. The central theme of the Bulk Synchronous Parallel model is to provide a high level abstraction of parallel computing hardware whilst providing a realisation of a parallel programming model that enables architecture independent programs to deliver scalable performance on diverse hardware platforms. Therefore, the primary objective of this report is to investigate the feasibility of developing a portable, scalable, parallel object database, based on the Bulk Synchronous Parallel model of computation. In particular, we devise a way of providing highlevel abstra...
Software Issues In HighPerformance Computing And A Framework For The Development Of HPC Applications
 COMPUTING, U. VISHKIN, ED.: ACM
, 1994
"... We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem spe ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
We identify the following key problems faced by HPC software: (1) the large gap between HPC design and implementation models in application development, (2) achieving high performance for a single application on different HPC platforms, and (3) accommodating constant changes in both problem specification and target architecture as computational methods and architectures evolve. To attack these problems, we suggest an application development methodology in which highlevel architectureindependent specifications are elaborated, through an iterative refinement process which introduces architectural detail, into a form which can be translated to efficient lowlevel architecturespecific programming notations. A treestructured development process permits multiple architectures to be targeted with implementation strategies appropriate to each architecture, and also provides a systematic means to accommodate changes in specification and target architecture. We describe the Pr...
Predicting the Running Times of Parallel Programs by Simulation
, 1998
"... Predicting the running time of a parallel program is useful for determining the optimal values for the parameters of the implementation and the optimal mapping of data on processors. However, deriving an explicit formula for the running time of a certain parallel program is a difficult task. We pres ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Predicting the running time of a parallel program is useful for determining the optimal values for the parameters of the implementation and the optimal mapping of data on processors. However, deriving an explicit formula for the running time of a certain parallel program is a difficult task. We present a new method for the analysis of parallel programs: simulating the execution of parallel programs by following their control flow and by determining, for each processor, the sequence of send and receive operations according to the LogGP model. We developed two algorithms to simulate the LogGP communication between processors and we tested them on the blocked parallel version of the Gaussian Elimination algorithm on the Meiko CS2 parallel machine. Our implementation showed that the LogGP simulation is able to detect the nonlinear behavior of the program running times, to indicate the differences in running times for different data layouts and to find the local optimal value of the block ...
Modeling Performance of Parallel Programs
, 1995
"... The actual performance of parallel programs is often disappointing, especially in comparison to the peak performance offered by the underlying hardware. There are many sources of performance degradation and understanding these sources is necessary to improve application performance. In this paper we ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The actual performance of parallel programs is often disappointing, especially in comparison to the peak performance offered by the underlying hardware. There are many sources of performance degradation and understanding these sources is necessary to improve application performance. In this paper we discuss performance modeling, an approach to understanding the performance of parallel systems. We present a survey of current approaches to modeling (both analytical modeling based on system parameters, and structural modeling based on the structure of the program), and propose a combination of these two approaches as a promising direction for new work. This combination is explored by evaluating and proposing improvements to lost cycles analysis, which already contains features from both approaches, and also combines measurement and modeling. Supported by CNPq, Brazil, Grant No. 200.86293/6 1 Introduction One disappointing contrast in parallel systems is between the nominal (e.g., peak...
A GeneralPurpose Model for Heterogeneous Computation
, 2000
"... Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel ap ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. For example, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new approach is necessary for designing efficient heterogeneous parallel programs. We propose
The Proteus System for the Development of Parallel Applications
, 1994
"... Target Language In our methodology we have identified a small set of specifications that comprise the abstract target language (ATL) of the refinement system. These are specifications of types such as arrays, lists, tuples, integers, characters, etc., that commonly appear in programming languages. ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Target Language In our methodology we have identified a small set of specifications that comprise the abstract target language (ATL) of the refinement system. These are specifications of types such as arrays, lists, tuples, integers, characters, etc., that commonly appear in programming languages. The refinement expresses a system as a definitional extension the ATL specs. Thus by associating a modela concrete type in a specific programming languagewith each ATL specification the complete system specification is compiled. 5.2.3 Proteus to DPL Translation The translation of Proteus to DPL consists of a series of major steps: 1. Expansion of iterator expressions into image and filter expressions. 2. Conversion to dataparallel form. 3. An interpretation of sequences into the nested sequence vocabulary of DPL. 4. Addition of storage management code. 5. Conversion into C. Source Mediating Target CORESEQ SEQASARRAY ARRAY SEQ Component 1 System Component 2 CORESEQ SEQASARRAY ...