Results 1 - 10
of
22
A Resource Query Interface for Network-Aware Applications
- Cluster Computing
, 1999
"... Development of portable network-aware applications demands an interface to the network that allows an application to obtain information about its execution environment. This paper motivates and describes the design of Remos, an API that allows network-aware applications to obtain relevant informatio ..."
Abstract
-
Cited by 55 (15 self)
- Add to MetaCart
Development of portable network-aware applications demands an interface to the network that allows an application to obtain information about its execution environment. This paper motivates and describes the design of Remos, an API that allows network-aware applications to obtain relevant information. The major challenges in defining a uniform interface are network heterogeneity, diversity in traffic requirements, variability of the information, and resource sharing in the network. Remos addresses these issues with two abstraction levels, explicit management of resource sharing, and statistical measurements. The flows abstraction captures the communication between nodes, and the topologies abstraction provides a logical view of network connectivity. Remos measurements are made at network level, and therefore information to manage sharing of resources is available. Remos is designed to deliver best effort information to applications, and it explicitly adds statistical reliability and va...
ReMoS: A Resource Monitoring System for Network-Aware Applications
, 1997
"... Development of portable network-aware applications demands an interface to the network that allows an application to obtain information about its execution environment. This paper motivates and describes the design of Remos, an API that allows network-aware applications to obtain relevant informatio ..."
Abstract
-
Cited by 41 (8 self)
- Add to MetaCart
Development of portable network-aware applications demands an interface to the network that allows an application to obtain information about its execution environment. This paper motivates and describes the design of Remos, an API that allows network-aware applications to obtain relevant information. The major challenges in defining a uniform interface are network heterogeneity, diversity in traffic requirements, variability of the information, and resource sharing in the network. Remos addresses these issues with two abstraction levels, explicit management of resource sharing, and statistical measurements. The flows abstraction captures the communication between nodes, and the topologies abstraction provides a logical view of network connectivity. Remos measurements are made at network level, and therefore information to manage sharing of resources is available. Remos is designed to deliver best effort information to applications, and it explicitly adds statistical reliability and variability measures to the core information. The paper also presents preliminary results and experience with a prototype Remos implementation for a high speed IP-based network testbed.
Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 2000
"... This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to a ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of o...
Parallel Breadth-First BDD Construction
- In Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel algorithm for BDD construction targeted at shared ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel algorithm for BDD construction targeted at shared memory multiprocessors and distributed shared memory systems. This algorithm focuses on improving memory access locality through specialized memory managers and partial breadth-first expansion, and on improving processor utilization through dynamic load balancing. The results on a shared memory system show speedups of over two on four processors and speedups of up to four on eight processors. The measured results clearly identify the main source of bottlenecks and point out some interesting directions for further improvements. 1 Introduction With the increasing complexity of protocol and circuit designs, formal verification has become an important research area. As an example, in 1994, In...
CPR: Mixed Task and Data Parallel Scheduling for Distributed Systems
- In Proceedings of the 15th International Parallel and Distributed Symposium
, 2001
"... It is well-known that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarse-grain data-parall ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
It is well-known that mixing task and data parallelism to solve large computational applications often yields better speedups compared to either applying pure task parallelism or pure data parallelism. Typically, the applications are modeled in terms of a dependence graph of coarse-grain data-parallel tasks, called a data-parallel task graph. In this paper we present a new compile-time heuristic, named Critical Path Reduction (CPR), for scheduling data-parallel task graphs. Experimental results based on graphs derived from real problems as well as synthetic graphs, show that CPR achieves higher speedup compared to other wellknown existing scheduling algorithms, at the expense of some higher cost. These results are also confirmed by performance measurements of two real applications (i.e., complex matrix multiplication and Strassen matrix multiplication) running on a cluster of workstations.
A Coordination Language for Mixed Task and Data Parallel Programs
- In proceedings of 3rd Annual ACM Symposium on Applied Computing (SAC'99
, 1999
"... We present a coordination model to derive efficient implementations using mixed task and data parallelism. The model provides a specification language in which the programmer defines the available degree of parallelism and a coordination language in which the programmer determines how the potential ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We present a coordination model to derive efficient implementations using mixed task and data parallelism. The model provides a specification language in which the programmer defines the available degree of parallelism and a coordination language in which the programmer determines how the potential parallelism is exploited for a specific implementation. Specification programs depend only on the algorithm whereas coordination programs may be different for different target machines in order to obtain the best performance. The transformation of a specification program into a coordination program is performed in well-defined steps where each step selects a specific implementation detail. Therefore, the transformation can be automated, thus guaranteeing a correct target program. We demonstrate the usefulness of the model by applying it to solution methods for differential equations.
A Data and Task Parallel Image Processing Environment
- Parallel Computing
, 2001
"... The paper presents a data and task paxallel environment for parallelizing low-level image processing applications on distributed memory systems. Image processing operators axe paxallelized by data decomposition using algorithmic skeletons. At the application level we use task decomposition, base ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The paper presents a data and task paxallel environment for parallelizing low-level image processing applications on distributed memory systems. Image processing operators axe paxallelized by data decomposition using algorithmic skeletons. At the application level we use task decomposition, based on the Image Application Task Graph.
COLT_HPF, a Run-Time Support for the High-Level Coordination of HPF Tasks
- of HPF Tasks, Concurrency: Practice and Experience, Vol
, 1999
"... ions (SDAs), using a syntax similar to that of HPF. Each instance of an SDA encapsulates distributed data and methods, where methods have exclusive access to encapsulated data. Data parallel tasks are thus started by creating instances of specific SDAs, while the inter-task co-operation takes place ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
ions (SDAs), using a syntax similar to that of HPF. Each instance of an SDA encapsulates distributed data and methods, where methods have exclusive access to encapsulated data. Data parallel tasks are thus started by creating instances of specific SDAs, while the inter-task co-operation takes place by means of remote synchronous (or asynchronous) method invocations. Note that SDA instances are started dynamically by a so called coordination task, so that the run-time that implements inter-task communication has to control passing distributed data structures from one task to another, including any possible remapping that might be needed. The run-time accomplishes this through a handshaking protocol, which exchanges the distribution information about the actual argument (on the caller SDA) and the formal one (on the callee SDA) of a given method. Note that this handshaking protocol is very similar to the COLT HPF protocol to create a channel between two tasks. Finally, even though in th...
Exploiting Processor Groups to Extend Scalability of the GA Shared Memory Programming Model
- in proceedings of ACM Computing Frontiers
, 2005
"... Exploiting processor groups is becoming increasingly important for programming next-generation high-end systems composed of tens or hundreds of thousands of processors. This paper discusses the requirements, functionality and development of multilevel-parallelism based on processor groups in the con ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Exploiting processor groups is becoming increasingly important for programming next-generation high-end systems composed of tens or hundreds of thousands of processors. This paper discusses the requirements, functionality and development of multilevel-parallelism based on processor groups in the context of the Global Array (GA) shared memory programming model. The main effort involves management of shared data, rather than interprocessor communication. Experimental results for the NAS NPB Conjugate Gradient benchmark and a molecular dynamics (MD) application are presented for a Linux cluster with Myrinet and illustrate the value of the proposed approach for improving scalability. While the original GA version of the CG benchmark lagged MPI, the processorgroup version outperforms MPI in all cases, except for a few points on the smallest problem size. Similarly, processor groups were very effective in improving scalability of a Molecular Dynamics application.
Library Support for Hierarchical Multi-Processor Tasks
- In Proc. of the Supercomputing 2002
, 2002
"... The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are m ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
The paper considers the modular programming with hierarchically structured multi-processor tasks on top of SPMD tasks for distributed memory machines. The parallel execution requires a corresponding decomposition of the set of processors into a hierarchical group structure onto which the tasks are mapped. This results in a multi-level group SPMD computation model with varying processor group structures. The advantage of this kind of mixed task and data parallelism is a potential to reduce the communication overhead and to increase scalability. We present a runtime library to support the coordination of hierarchically structured multi-processor tasks. The library exploits an extended parallel group SPMD programming model and manages the entire task execution including the dynamic hierarchy of processor groups. The library is built on top of MPI, has an easy-to-use interface, and leads to only a marginal overhead while allowing static planning and dynamic restructuring.

