Results 1  10
of
42
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
, 1999
"... Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported ..."
Abstract

Cited by 206 (4 self)
 Add to MetaCart
Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors' addresses: Y.K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: ykwok@eee.hku.hk; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. 2000 ACM 03600300/99/12000406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999 1.
A Petri Net Approach for Performance Oriented Parallel Program Design
, 1992
"... Performance orientation in the development process of parallel software is motivated by outlining the misconception of current approaches where performance activies come in at the very end of the development, mainly in terms of measurement or monitoring after the implementation phase. At that time m ..."
Abstract

Cited by 35 (6 self)
 Add to MetaCart
Performance orientation in the development process of parallel software is motivated by outlining the misconception of current approaches where performance activies come in at the very end of the development, mainly in terms of measurement or monitoring after the implementation phase. At that time major part of the development work is already done, and performance pitfalls are very hard to repair  if this is possible at all. A development process for parallel programs that launches performance engineering in the early design phase is proposed, based on a Petri net specification methodology for the performance critical parts of a parallel system. The Petri net formalism is used to define Program Resource Mappingnet (PRMnet) models, that serve as an integrated performance model of parallel processing systems, combining performance characteristics of parallel programs (Pnet), parallel hardware (Rnet) and the assignment of programs to hardware (Mapping) into a single performance model...
Parallelization of the Vehicle Routing Problem with Time Windows
, 2001
"... Routing with time windows (VRPTW) has been an area of research that have
attracted many researchers within the last 10 { 15 years. In this period a number
of papers and technical reports have been published on the exact solution of the
VRPTW.
The VRPTW is a generalization of the wellknown capacitat ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Routing with time windows (VRPTW) has been an area of research that have
attracted many researchers within the last 10 { 15 years. In this period a number
of papers and technical reports have been published on the exact solution of the
VRPTW.
The VRPTW is a generalization of the wellknown capacitated routing problem
(VRP or CVRP). In the VRP a
eet of vehicles must visit (service) a number
of customers. All vehicles start and end at the depot. For each pair of customers
or customer and depot there is a cost. The cost denotes how much is costs a
vehicle to drive from one customer to another. Every customer must be visited
exactly ones. Additionally each customer demands a certain quantity of goods
delivered (know as the customer demand). For the vehicles we have an upper
limit on the amount of goods that can be carried (known as the capacity). In
the most basic case all vehicles are of the same type and hence have the same
capacity. The problem is now for a given scenario to plan routes for the vehicles
in accordance with the mentioned constraints such that the cost accumulated
on the routes, the xed costs (how much does it cost to maintain a vehicle) or
a combination hereof is minimized.
In the more general VRPTW each customer has a time window, and between
all pairs of customers or a customer and the depot we have a travel time. The
vehicles now have to comply with the additional constraint that servicing of the
customers can only be started within the time windows of the customers. It
is legal to arrive before a time window \opens" but the vehicle must wait and
service will not start until the time window of the customer actually opens.
For solving the problem exactly 4 general types of solution methods have
evolved in the literature: dynamic programming, DantzigWolfe (column generation),
Lagrange decomposition and solving the classical model formulation
directly.
Presently the algorithms that uses DantzigWolfe given the best results
(Desrochers, Desrosiers and Solomon, and Kohl), but the Ph.D. thesis of Kontoravdis
shows promising results for using the classical model formulation directly.
In this Ph.D. project we have used the DantzigWolfe method. In the
DantzigWolfe method the problem is split into two problems: a \master problem"
and a \subproblem". The master problem is a relaxed set partitioning
v
vi
problem that guarantees that each customer is visited exactly ones, while the
subproblem is a shortest path problem with additional constraints (capacity and
time window). Using the master problem the reduced costs are computed for
each arc, and these costs are then used in the subproblem in order to generate
routes from the depot and back to the depot again. The best (improving) routes
are then returned to the master problem and entered into the relaxed set partitioning
problem. As the set partitioning problem is relaxed by removing the
integer constraints the solution is seldomly integral therefore the DantzigWolfe
method is embedded in a separationbased solutiontechnique.
In this Ph.D. project we have been trying to exploit structural properties in
order to speed up execution times, and we have been using parallel computers
to be able to solve problems faster or solve larger problems.
The thesis starts with a review of previous work within the eld of VRPTW
both with respect to heuristic solution methods and exact (optimal) methods.
Through a series of experimental tests we seek to dene and examine a number
of structural characteristics.
The rst series of tests examine the use of dividing time windows as the
branching principle in the separationbased solutiontechnique. Instead of using
the methods previously described in the literature for dividing a problem into
smaller problems we use a methods developed for a variant of the VRPTW. The
results are unfortunately not positive.
Instead of dividing a problem into two smaller problems and try to solve
these we can try to get an integer solution without having to branch. A cut is an
inequality that separates the (nonintegral) optimal solution from all the integer
solutions. By nding and inserting cuts we can try to avoid branching. For the
VRPTW Kohl has developed the 2path cuts. In the separationalgorithm for
detecting 2path cuts a number of test are made. By structuring the order in
which we try to generate cuts we achieved very positive results.
In the DantzigWolfe process a large number of columns may be generated,
but a signicant fraction of the columns introduced will not be interesting with
respect to the master problem. It is a priori not possible to determine which
columns are attractive and which are not, but if a column does not become part
of the basis of the relaxed set partitioning problem we consider it to be of no
benet for the solution process. These columns are subsequently removed from
the master problem. Experiments demonstrate a signicant cut of the running
time.
Positive results were also achieved by stopping the routegeneration process
prematurely in the case of timeconsuming shortest path computations. Often
this leads to stopping the shortest path subroutine in cases where the information
(from the dual variables) leads to \bad" routes. The premature exit
from the shortest path subroutine restricts the generation of \bad" routes signi
cantly. This produces very good results and has made it possible to solve
problem instances not solved to optimality before.
The parallel algorithm is based upon the sequential DantzigWolfe based
algorithm developed earlier in the project. In an initial (sequential) phase unsolved
problems are generated and when there are unsolved problems enough
vii
to start work on every processor the parallel solution phase is initiated. In the
parallel phase each processor runs the sequential algorithm. To get a good workload
a strategy based on balancing the load between neighbouring processors is
implemented. The resulting algorithm is eÆcient and capable of attaining good
speedup values. The loadbalancing strategy shows an even distribution of work
among the processors. Due to the large demand for using the IBM SP2 parallel
computer at UNIC it has unfortunately not be possible to run as many tests
as we would have liked. We have although managed to solve one problem not
solved before using our parallel algorithm.
Architectures for Efficient Implementation of Particle Filters
, 2004
"... Particle filters are sequential Monte Carlo methods that are used in numerous problems where timevarying signals must be presented in real time and where the objective is to estimate various unknowns of the signal and/or detect events described by the signals. The standard solutions of such proble ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
Particle filters are sequential Monte Carlo methods that are used in numerous problems where timevarying signals must be presented in real time and where the objective is to estimate various unknowns of the signal and/or detect events described by the signals. The standard solutions of such problems in many applications are based on the Kalman filters or extended Kalman filters. In situations when the problems are nonlinear or the noise that distorts the signals is nonGaussian, the Kalman filters provide a solution that may be far from optimal. Particle filters are an intriguing alternative to the Kalman filters due to their excellent performance in very di#cult problems including communications, signal processing, navigation, and computer vision. Hence, particle filters have been the focus of wide research recently and immense literature can be found on their theory. Most of these works recognize the complexity and computational intensity of these filters, but there has been no e#ort directed toward the implementation of these filters in hardware. The objective of this dissertation is to develop, design, and build e#cient hardware for particle filters, and thereby bring them closer to practical applications. The fact that particle filters outperform most of the traditional filtering methods in many complex practical scenarios, coupled with the challenges related to decreasing their computational complexity and improving realtime performance, makes this work worthwhile. The main
Iterative Compilation and Performance Prediction for Numerical Applications
, 2004
"... As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compilerbased approaches to improving memory performance are limited and programmers fr ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compilerbased approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and timeconsuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedbackdirected program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of
A Scalable Tuple Space Model for Structured Parallel Programming
 In Proc. of the 1995 2nd Int’l Conf. on Programming Models for Massively Parallel Computers
, 1995
"... The paper proposes and analyses a scalable model of an associative distributed shared memory for massively parallel architectures. The proposed model is hierarchical and fits the modern style of structured parallel programming. If parallel applications are composed of a set of modules with a wellde ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The paper proposes and analyses a scalable model of an associative distributed shared memory for massively parallel architectures. The proposed model is hierarchical and fits the modern style of structured parallel programming. If parallel applications are composed of a set of modules with a welldefined scope of interaction, the proposed model can induce a memory access latency time that only logarithmically increases with the number of nodes. Experimental results show the effectiveness of the model with a transputerbased implementation 1. Introduction The lack of any globally shared resources makes massively parallel architectures intrinsically scalable. However, the need for a global space of interaction is difficult to be substituted for many reasons. In particular:  global computational models are intrinsically simpler than local ones;  even applications based on local computation models need shared resources, for instance to create a unique naming system. The above problem...
A Parallel Algorithm for CompileTime Scheduling of Parallel Programs on Multiprocessors
 PACT'97
, 1997
"... In this paper, we propose a parallel randomized algorithm, called Parallel Fast Assignment using Search Technique (PFAST), for scheduling parallel programs represented by directed acyclic graphs (DAGs) during compiletime. The PFAST algorithm has time complexity where e is the number of edges in th ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
In this paper, we propose a parallel randomized algorithm, called Parallel Fast Assignment using Search Technique (PFAST), for scheduling parallel programs represented by directed acyclic graphs (DAGs) during compiletime. The PFAST algorithm has time complexity where e is the number of edges in the DAG. This lineartime algorithm works by first generating an initial solution and then refining it using a parallel random search. Using a prototype computeraided parallelization and scheduling tool called CASCH, the algorithm is found to outperform numerous previous algorithms while taking dramatically smaller execution times. The distinctive feature of this research is that, instead of simulations, our proposed algorithm is evaluated and compared with other algorithms using the CASCH tool with real applications running on the Intel Paragon. The PFAST algorithm is also evaluated with randomly generated DAGs for which optimal schedules are known. The algorithm generated optimal solutions for a majority of the test cases and closetooptimal solutions for the others. The proposed algorithm is the fastest scheduling algorithm known to us and is an attractive choice for scheduling under running time constraints.
Emulation of a Virtual Shared Memory Architecture
 Department of Computer Science, University of Bristol, Bristol
, 1993
"... In designing a multiprocessor architecture, the motivating factors are that the architecture should be general purpose, easier to program and at the same time scalable. The Data Diffusion Machine (DDM) seeks to fulfil such criteria. The DDM provides shareddata access on distributed memory hardware, ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
In designing a multiprocessor architecture, the motivating factors are that the architecture should be general purpose, easier to program and at the same time scalable. The Data Diffusion Machine (DDM) seeks to fulfil such criteria. The DDM provides shareddata access on distributed memory hardware, allowing data to freely migrate to processors on demand. The DDM concept was originally proposed in terms of a hierarchy of buses, but has since been elaborated for different interconnects. This thesis presents a linkbased realisation of the architecture and a linkbased coherence protocol which is central in maintaining coherence of data. The linkbased protocol exploits the combining properties of the DDM network to minimise traffic in the DDM hierarchy. The protocol also contains efficient and general support for synchronisation. To evaluate the design and performance of new architectures, tracedriven simulation is often used. This thesis presents a novel prototyping and performance ev...
Evaluating Parallel Algorithms: Theoretical and Practical Aspects
, 1990
"... The motivation for the work reported in this thesis has been to lessen the gap between theory and practice within the eld of parallel computing. When looking for new and faster parallel algorithms for use in massively parallel systems, it is tempting to investigate promising alternatives from the la ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
The motivation for the work reported in this thesis has been to lessen the gap between theory and practice within the eld of parallel computing. When looking for new and faster parallel algorithms for use in massively parallel systems, it is tempting to investigate promising alternatives from the large body of research done on parallel algorithms within the eld of theoretical computer science. These algorithms are mainly described for the PRAM (Parallel Random Access Machine) model of computation. This thesis proposes a method for evaluating the practical value of PRAM algorithms. The approach is based on implementing PRAM algorithms for execution on a CREW (Concurrent Read Exclusive Write) PRAM simulator. Measuring and analysis of implemented algorithms on nite problems provide new and more practically oriented results than those traditionally obtained by asymptotical analysis (Onotation). The evaluation method is demonstrated by investigating the practical value of a new and important parallel sorting algorithm from theoretical