Results 1 - 10
of
21
MULTIPROCESSOR SCHEDULING TO ACCOUNT FOR INTERPROCESSOR COMMUNICATION
, 1991
"... Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essenti ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essential for attaining efficient hardware utilization. This thesis introduces two new compile-time heuristics for scheduling precedence graphs onto multiprocessor architectures, which account for interprocessor communication overheads and interconnection constraints in the architecture. These algorithms perform scheduling and routing simultaneously to account for irregular interprocessor interconnections, and schedule all communications as well as all computations to eliminate shared resource contention. The first technique, called dynamic-level scheduling, modifies the classical HLFET list scheduling strategy to account for IPC and synchronization overheads. By using dynamically changing priorities to match nodes and processors at each step, this technique attains an equitable tradeoff between load balancing and interprocessor communication cost. This method is fast, flexible, widely targetable, and displays promising perforrnance. The second technique, called declustering, establishes a parallelism hierarchy upon the precedence graph using graph-analysis techniques which explicitly address the tradeoff between exploiting parallelism and incurring communication cost. By systematically decomposing this hierarchy, the declustering process exposes parallelism instances in order of importance, assuring efficient use of the available processing resources. In contrast with traditional clustering schemes, this technique can adjust the level of cluster granularity to suit the characteristics of the specified architecture, leading to a more effective solution.
PARSE: Simulation of Message Passing Communication Networks
- In Proceedings of the 27th Annual Simulation Symposium
, 1994
"... The number of design decisions for communication network hardware in message passing distributed memory systems is quite large, as illustrated by many different implemented and proposed designs. Many of the decisions are driven by, on one side, performance requirements of targeted applications for t ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The number of design decisions for communication network hardware in message passing distributed memory systems is quite large, as illustrated by many different implemented and proposed designs. Many of the decisions are driven by, on one side, performance requirements of targeted applications for the parallel system, and on the other side, hardware implementation costs. To obtain cost effective communication hardware matching a certain application domain, use of a parallel architecture simulation framework is inevitable. The simulator presented here, named PARSE, offers a base for such a framework. It can accurately simulate a wide range of communication architectures and, unlike analytic communication models, does properly model all performance aspects. Changing parameters of simulated architectures is very flexible and the object oriented implementation of the simulator facilitates future extensions and enhancements. 1 Introduction Given the many different communication hardware de...
Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors
- Proc. 10th ACM Intl. Conf. on Supercomputing
, 1996
"... In this paper, performance of wormhole routed 2-D torus network with virtual channels has been evaluated for cachecoherent shared-memory multiprocessors with executiondriven simulation. The traffic in such systems is very different from the traffic in message-passing environment. We show the impact ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper, performance of wormhole routed 2-D torus network with virtual channels has been evaluated for cachecoherent shared-memory multiprocessors with executiondriven simulation. The traffic in such systems is very different from the traffic in message-passing environment. We show the impact of number of virtual channels, flit buffers per virtual channel, and internal links. The study shows that 4 virtual channels per link is most efficient for 2-D torus networks. The number of flit buffers per virtual channel has a considerable impact and 2 to 4 flit buffers are usually enough. The number of internal links makes a difference on the performance for applications, such as MP3D, that generate large contention for shared variables. 1 Introduction Large-scale shared-memory multiprocessors are difficult to design but they provide a unified view of the memory for easy programming. These systems are built using processormemory nodes that are connected through an interconnection network...
Systolic Combining Switch Designs
, 1994
"... oy. ii Contents 1 Introduction 1 1.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1.1 Performance analysis of different switch types : : : : : : : : : : : : : : : : : : : : : : : 1 1.1.2 An efficient CMOS implementation of systolic qu ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
oy. ii Contents 1 Introduction 1 1.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1.1 Performance analysis of different switch types : : : : : : : : : : : : : : : : : : : : : : : 1 1.1.2 An efficient CMOS implementation of systolic queues : : : : : : : : : : : : : : : : : : 2 1.1.3 Cost and performance of an implemented combining switch : : : : : : : : : : : : : : : 2 1.1.4 Methods for providing greater combining capability : : : : : : : : : : : : : : : : : : : 3 1.2 Related research : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3 1.2.1 Interconnection network topology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 1.2.2 Routing protocol : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 1
Performance Analysis Of Cluster-Based Multiprocessors
- IEEE Transactions on Computers
, 1994
"... A queueing model for performance evaluation of cluster-based multiprocessors is proposed in this paper. Most system components are modelled as M=D=1=L queues to capture deterministic service time and finite buffer behavior. Various subsystems are analyzed independently and then integrated for the sy ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
A queueing model for performance evaluation of cluster-based multiprocessors is proposed in this paper. Most system components are modelled as M=D=1=L queues to capture deterministic service time and finite buffer behavior. Various subsystems are analyzed independently and then integrated for the system level analysis. Average delay, throughput, and processor utilization are the performance parameters studied in this analysis. The analytical results are first validated via simulation. Next, several design alternatives are discussed using the model. These include the effect of buffer length and identification of bottleneck centers for various design configurations. * This research was supported in part by the National Science Foundation under grant MIP-9104485. I. INTRODUCTION Cluster-based multiprocessors, also known as hierarchical systems, are designed to reduce the network complexity by incorporating hierarchies of interconnection networks [1-3]. These systems take advantage of t...
Performance Model for a Prioritized Multiple-Bus Multiprocessor System
, 1996
"... The performance of a shared memory multiprocessor system with a multiple-bus interconnection network is studied in this paper. The effect of bus and memory contention is modeled using a probabilistic model and a closed form solution for the acceptance probability of each processor is presented. It i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The performance of a shared memory multiprocessor system with a multiple-bus interconnection network is studied in this paper. The effect of bus and memory contention is modeled using a probabilistic model and a closed form solution for the acceptance probability of each processor is presented. It is assumed that each processor in the system has a distinct priority assigned to it and that arbitration is based on priority. Whenever a request from a processor is rejected due to bus or memory conflicts, the request is resubmitted until granted. Based on the model, individual processor acceptance probabilities are first estimated, from which the effective memory bandwidth is computed. The accuracy of the analytical model is verified based on simulation results. Results from the model are compared against other approximate models previously reported in literature. It is observed that the inaccuracy of the model measured in terms of error from simulation results is less than that in previous...
Effect of Virtual Channels and Memory Organization on Cache-Coherent Shared-Memory Multiprocessors
, 1996
"... In this paper, performance of wormhole routed 2-D torus network with virtual channels has been evaluated for cache-coherent shared-memory multiprocessors with execution-driven simulation using various applications. The traffic in such systems is very different from the traffic in message-passing env ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, performance of wormhole routed 2-D torus network with virtual channels has been evaluated for cache-coherent shared-memory multiprocessors with execution-driven simulation using various applications. The traffic in such systems is very different from the traffic in message-passing environment and is characterized by traffic bursts, one-to-many and many-to-one traffic, and small fixed length messages. We show the impact of various network parameters, such as number of virtual channels, number of flit buffers per virtual channel, and number of internal links. We have also considered low-order and high-order interleaving of memory blocks on nodes to show its impact on the network performance. The study shows that 4 virtual channels per link is most efficient for 2-D torus networks. The number of flit buffers per virtual channel also has a considerable impact and 2 to 4 flit buffers are usually enough. The number of internal links also has an impact on the performance for applications, such as MP3D, that generate large contention for shared variables. Larger number of internal links are also useful in case of high-order interleaved memory to reduce hot-spots at the communication interface of favorite nodes.
Generalizing Interconnection Network Models
, 1994
"... For the performance evaluation of the interconnection networks (INs) of multi-processors and multi-computers many different methods are used. These methods result in different models and different performance measures, which makes it difficult to compare the performance of the INs. In this report a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
For the performance evaluation of the interconnection networks (INs) of multi-processors and multi-computers many different methods are used. These methods result in different models and different performance measures, which makes it difficult to compare the performance of the INs. In this report a generalized model of interconnection networks is proposed. This model captures among others the multiple bus, crossbar and hypercube. The model is based on the observation that most interconnection networks consisting of more than one dimension or stage have similarities in the way they are `built' from one dimensional INs. The model considers the different INs from a shared memory point of view. The performance of the model is evaluated with a few different methods for both open and closed system models. A new method is proposed to increase the accuracy of the performance prediction. The results of the performance evaluation and cost calculation of the different INs captured by the genera...
Machine independent Analytical models for cost evaluation of template-based programs
, 1996
"... Structured parallel programming is one of the possible solutions to exploit Programmability, Portability and Performance in the parallel programming world. The power of this approach stands in the possibility to build an optimizing template-- based compiler using low time complexity algorithms. I ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Structured parallel programming is one of the possible solutions to exploit Programmability, Portability and Performance in the parallel programming world. The power of this approach stands in the possibility to build an optimizing template-- based compiler using low time complexity algorithms. In order to optimize the code, this compiler needs formulas that describe the performance of language constructs over the target architecture. We propose a set of parameters able to describe current parallel systems and build deterministic analytical models for basic forms of parallelism. The analytical model describes construct performance in a parametric way.This can be done by knowing that the compiler exploits a template--based support and giving template implementors guidelines to follow to make actual implementation perform as predicted. ACM--CR Subject Classification: Keyword and phrases: Skeletons, performance modeling, parallel languages, template-- based compilers Machine...
Effect of CC-NUMA Memory Management on the Performance of Interconnection Networks
, 1998
"... CC-NUMA architectures have become extremely popular by providing fast and transparent access to data with multiple levels of caches and local and remote memories. However, the bottleneck remains in the remote memory access that has latencies several magnitudes higher than the cache access. Designing ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
CC-NUMA architectures have become extremely popular by providing fast and transparent access to data with multiple levels of caches and local and remote memories. However, the bottleneck remains in the remote memory access that has latencies several magnitudes higher than the cache access. Designing effective data allocation policies that provide local memory data access and limit the need to access remote memories remains a challenge. We study three different static memory management policies, namely buddy, round-robin and first-touch, and analyze their impact on data locality and application memory access patterns. Interconnection network performance depends heavily on the memory access patterns of the workload. Using these realistic memory management policies, we reevaluate the performance of a multistage interconnection network (MIN). Limited earlier work in the area of CC-NUMA memory management has assumed constant network delays and thus ignored the impact of switch design on per...

