Results 1 - 10
of
34
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
, 1999
"... Devices]: Modes of Computation---Parallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported ..."
Abstract
-
Cited by 142 (4 self)
- Add to MetaCart
Devices]: Modes of Computation---Parallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors' addresses: Y.-K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: ykwok@eee.hku.hk; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. 2000 ACM 0360-0300/99/1200--0406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999 1.
A Petri Net Approach for Performance Oriented Parallel Program Design
, 1992
"... Performance orientation in the development process of parallel software is motivated by outlining the misconception of current approaches where performance activies come in at the very end of the development, mainly in terms of measurement or monitoring after the implementation phase. At that time m ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
Performance orientation in the development process of parallel software is motivated by outlining the misconception of current approaches where performance activies come in at the very end of the development, mainly in terms of measurement or monitoring after the implementation phase. At that time major part of the development work is already done, and performance pitfalls are very hard to repair - if this is possible at all. A development process for parallel programs that launches performance engineering in the early design phase is proposed, based on a Petri net specification methodology for the performance critical parts of a parallel system. The Petri net formalism is used to define Program Resource Mapping-net (PRM-net) models, that serve as an integrated performance model of parallel processing systems, combining performance characteristics of parallel programs (P-net), parallel hardware (R-net) and the assignment of programs to hardware (Mapping) into a single performance model...
Parallelization of the Vehicle Routing Problem with Time Windows
, 2001
"... Routing with time windows (VRPTW) has been an area of research that have
attracted many researchers within the last 10 { 15 years. In this period a number
of papers and technical reports have been published on the exact solution of the
VRPTW.
The VRPTW is a generalization of the well-known capacitat ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Routing with time windows (VRPTW) has been an area of research that have
attracted many researchers within the last 10 { 15 years. In this period a number
of papers and technical reports have been published on the exact solution of the
VRPTW.
The VRPTW is a generalization of the well-known capacitated routing problem
(VRP or CVRP). In the VRP a
eet of vehicles must visit (service) a number
of customers. All vehicles start and end at the depot. For each pair of customers
or customer and depot there is a cost. The cost denotes how much is costs a
vehicle to drive from one customer to another. Every customer must be visited
exactly ones. Additionally each customer demands a certain quantity of goods
delivered (know as the customer demand). For the vehicles we have an upper
limit on the amount of goods that can be carried (known as the capacity). In
the most basic case all vehicles are of the same type and hence have the same
capacity. The problem is now for a given scenario to plan routes for the vehicles
in accordance with the mentioned constraints such that the cost accumulated
on the routes, the xed costs (how much does it cost to maintain a vehicle) or
a combination hereof is minimized.
In the more general VRPTW each customer has a time window, and between
all pairs of customers or a customer and the depot we have a travel time. The
vehicles now have to comply with the additional constraint that servicing of the
customers can only be started within the time windows of the customers. It
is legal to arrive before a time window \opens" but the vehicle must wait and
service will not start until the time window of the customer actually opens.
For solving the problem exactly 4 general types of solution methods have
evolved in the literature: dynamic programming, Dantzig-Wolfe (column generation),
Lagrange decomposition and solving the classical model formulation
directly.
Presently the algorithms that uses Dantzig-Wolfe given the best results
(Desrochers, Desrosiers and Solomon, and Kohl), but the Ph.D. thesis of Kontoravdis
shows promising results for using the classical model formulation directly.
In this Ph.D. project we have used the Dantzig-Wolfe method. In the
Dantzig-Wolfe method the problem is split into two problems: a \master problem"
and a \subproblem". The master problem is a relaxed set partitioning
v
vi
problem that guarantees that each customer is visited exactly ones, while the
subproblem is a shortest path problem with additional constraints (capacity and
time window). Using the master problem the reduced costs are computed for
each arc, and these costs are then used in the subproblem in order to generate
routes from the depot and back to the depot again. The best (improving) routes
are then returned to the master problem and entered into the relaxed set partitioning
problem. As the set partitioning problem is relaxed by removing the
integer constraints the solution is seldomly integral therefore the Dantzig-Wolfe
method is embedded in a separation-based solution-technique.
In this Ph.D. project we have been trying to exploit structural properties in
order to speed up execution times, and we have been using parallel computers
to be able to solve problems faster or solve larger problems.
The thesis starts with a review of previous work within the eld of VRPTW
both with respect to heuristic solution methods and exact (optimal) methods.
Through a series of experimental tests we seek to dene and examine a number
of structural characteristics.
The rst series of tests examine the use of dividing time windows as the
branching principle in the separation-based solution-technique. Instead of using
the methods previously described in the literature for dividing a problem into
smaller problems we use a methods developed for a variant of the VRPTW. The
results are unfortunately not positive.
Instead of dividing a problem into two smaller problems and try to solve
these we can try to get an integer solution without having to branch. A cut is an
inequality that separates the (non-integral) optimal solution from all the integer
solutions. By nding and inserting cuts we can try to avoid branching. For the
VRPTW Kohl has developed the 2-path cuts. In the separationalgorithm for
detecting 2-path cuts a number of test are made. By structuring the order in
which we try to generate cuts we achieved very positive results.
In the Dantzig-Wolfe process a large number of columns may be generated,
but a signicant fraction of the columns introduced will not be interesting with
respect to the master problem. It is a priori not possible to determine which
columns are attractive and which are not, but if a column does not become part
of the basis of the relaxed set partitioning problem we consider it to be of no
benet for the solution process. These columns are subsequently removed from
the master problem. Experiments demonstrate a signicant cut of the running
time.
Positive results were also achieved by stopping the route-generation process
prematurely in the case of time-consuming shortest path computations. Often
this leads to stopping the shortest path subroutine in cases where the information
(from the dual variables) leads to \bad" routes. The premature exit
from the shortest path subroutine restricts the generation of \bad" routes signi
cantly. This produces very good results and has made it possible to solve
problem instances not solved to optimality before.
The parallel algorithm is based upon the sequential Dantzig-Wolfe based
algorithm developed earlier in the project. In an initial (sequential) phase unsolved
problems are generated and when there are unsolved problems enough
vii
to start work on every processor the parallel solution phase is initiated. In the
parallel phase each processor runs the sequential algorithm. To get a good workload
a strategy based on balancing the load between neighbouring processors is
implemented. The resulting algorithm is eÆcient and capable of attaining good
speedup values. The loadbalancing strategy shows an even distribution of work
among the processors. Due to the large demand for using the IBM SP2 parallel
computer at UNIC it has unfortunately not be possible to run as many tests
as we would have liked. We have although managed to solve one problem not
solved before using our parallel algorithm.
Architectures for Efficient Implementation of Particle Filters
, 2004
"... Particle filters are sequential Monte Carlo methods that are used in numerous problems where time-varying signals must be presented in real time and where the objective is to estimate various unknowns of the signal and/or detect events described by the signals. The standard solutions of such proble ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Particle filters are sequential Monte Carlo methods that are used in numerous problems where time-varying signals must be presented in real time and where the objective is to estimate various unknowns of the signal and/or detect events described by the signals. The standard solutions of such problems in many applications are based on the Kalman filters or extended Kalman filters. In situations when the problems are nonlinear or the noise that distorts the signals is non-Gaussian, the Kalman filters provide a solution that may be far from optimal. Particle filters are an intriguing alternative to the Kalman filters due to their excellent performance in very di#cult problems including communications, signal processing, navigation, and computer vision. Hence, particle filters have been the focus of wide research recently and immense literature can be found on their theory. Most of these works recognize the complexity and computational intensity of these filters, but there has been no e#ort directed toward the implementation of these filters in hardware. The objective of this dissertation is to develop, design, and build e#cient hardware for particle filters, and thereby bring them closer to practical applications. The fact that particle filters outperform most of the traditional filtering methods in many complex practical scenarios, coupled with the challenges related to decreasing their computational complexity and improving real-time performance, makes this work worthwhile. The main
Iterative Compilation and Performance Prediction for Numerical Applications
, 2004
"... As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers fr ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
As the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of
A Scalable Tuple Space Model for Structured Parallel Programming
- In Proc. of the 1995 2nd Int’l Conf. on Programming Models for Massively Parallel Computers
, 1995
"... The paper proposes and analyses a scalable model of an associative distributed shared memory for massively parallel architectures. The proposed model is hierarchical and fits the modern style of structured parallel programming. If parallel applications are composed of a set of modules with a well-de ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The paper proposes and analyses a scalable model of an associative distributed shared memory for massively parallel architectures. The proposed model is hierarchical and fits the modern style of structured parallel programming. If parallel applications are composed of a set of modules with a well-defined scope of interaction, the proposed model can induce a memory access latency time that only logarithmically increases with the number of nodes. Experimental results show the effectiveness of the model with a transputer-based implementation 1. Introduction The lack of any globally shared resources makes massively parallel architectures intrinsically scalable. However, the need for a global space of interaction is difficult to be substituted for many reasons. In particular: - global computational models are intrinsically simpler than local ones; - even applications based on local computation models need shared resources, for instance to create a unique naming system. The above problem...
Interconnection Networks for Parallel Computers
, 1998
"... ded into SIMD and MIMD machines. In SingleInstruction -Stream-Multiple-Data-Stream (SIMD) parallel computers (4), each processor executes the same instruction stream that is distributed to each processor from a single control unit. All processors operate synchronously and will also generate messages ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
ded into SIMD and MIMD machines. In SingleInstruction -Stream-Multiple-Data-Stream (SIMD) parallel computers (4), each processor executes the same instruction stream that is distributed to each processor from a single control unit. All processors operate synchronously and will also generate messages to be transferred over the network synchronously. Thus, the network in SIMD machines has to support synchronous data transfers. In a Multiple-Instruction-Stream-Multiple-Data-Stream (MIMD) parallel computer (5), all processors operate asynchronously on their own instruction stream. The network in MIMD machines therefore has to support asynchronous data transfers. The interconnection network is an essential part of any parallel computer. Only if fast and reliable communication over the network is guaranteed, the parallel system will exhibit high performance. Many different interconnection networks for parallel computers were pro
Emulation of a Virtual Shared Memory Architecture
- Department of Computer Science, University of Bristol, Bristol
, 1993
"... In designing a multiprocessor architecture, the motivating factors are that the architecture should be general purpose, easier to program and at the same time scalable. The Data Diffusion Machine (DDM) seeks to fulfil such criteria. The DDM provides shared-data access on distributed memory hardware, ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In designing a multiprocessor architecture, the motivating factors are that the architecture should be general purpose, easier to program and at the same time scalable. The Data Diffusion Machine (DDM) seeks to fulfil such criteria. The DDM provides shared-data access on distributed memory hardware, allowing data to freely migrate to processors on demand. The DDM concept was originally proposed in terms of a hierarchy of buses, but has since been elaborated for different interconnects. This thesis presents a link-based realisation of the architecture and a link-based coherence protocol which is central in maintaining coherence of data. The link-based protocol exploits the combining properties of the DDM network to minimise traffic in the DDM hierarchy. The protocol also contains efficient and general support for synchronisation. To evaluate the design and performance of new architectures, trace-driven simulation is often used. This thesis presents a novel prototyping and performance ev...
Interconnection Topologies and Routing for Parallel Processing Systems
, 1992
"... The major aims of this work is to give a comparative survey of static interconnection topologies, and to discuss their properties with respect to their use as interconnection topologies in message passing multi-computer systems, i. e. each processing element has its own local memory, there is no com ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The major aims of this work is to give a comparative survey of static interconnection topologies, and to discuss their properties with respect to their use as interconnection topologies in message passing multi-computer systems, i. e. each processing element has its own local memory, there is no common memory, and the processing elements communicate via message-passing. To this end it was necessary to recall relevant measures on graphs from graph theory, like for example the average distance or the network diameter, and requirements from the parallel processing area, like the reliability or extensibility. Special emphasis has been given to present the construction rules for various graphs, because these seemed -- along with the network characteristics -- most relevant for interconnecting processing elements in reconfigurable multi-computer systems. Critical to applications in these kind of parallel systems is the possibility of exchanging local data among cooperating processing element...

