Results 1 - 10
of
98
Limits on Interconnection Network Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract
-
Cited by 166 (4 self)
- Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
MULTIPROCESSOR SCHEDULING TO ACCOUNT FOR INTERPROCESSOR COMMUNICATION
, 1991
"... Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essenti ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essential for attaining efficient hardware utilization. This thesis introduces two new compile-time heuristics for scheduling precedence graphs onto multiprocessor architectures, which account for interprocessor communication overheads and interconnection constraints in the architecture. These algorithms perform scheduling and routing simultaneously to account for irregular interprocessor interconnections, and schedule all communications as well as all computations to eliminate shared resource contention. The first technique, called dynamic-level scheduling, modifies the classical HLFET list scheduling strategy to account for IPC and synchronization overheads. By using dynamically changing priorities to match nodes and processors at each step, this technique attains an equitable tradeoff between load balancing and interprocessor communication cost. This method is fast, flexible, widely targetable, and displays promising perforrnance. The second technique, called declustering, establishes a parallelism hierarchy upon the precedence graph using graph-analysis techniques which explicitly address the tradeoff between exploiting parallelism and incurring communication cost. By systematically decomposing this hierarchy, the declustering process exposes parallelism instances in order of importance, assuring efficient use of the available processing resources. In contrast with traditional clustering schemes, this technique can adjust the level of cluster granularity to suit the characteristics of the specified architecture, leading to a more effective solution.
On the Geographic Location of Internet Resources
- IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
, 2002
"... ..."
LoGPC: Modeling Network Contention in Message-Passing Programs
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1998
"... In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, ..."
Abstract
-
Cited by 35 (4 self)
- Add to MetaCart
In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP [9] and LogGP [4] models to account for the impact of network contention and network interface DMA behavior on the performance of message-passing programs. We validate LoGPC by analyzing three applications implemented with Active Messages [11, 18] on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50% of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify tradeoffs between synchronous and asynchronous message passing styles.
Reconfiguration With Time Division Multiplexed MINs for Multiprocessor Communications
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, time-division multiplexed multistage interconnection networks (TDM-MINs) are proposed for multiprocessor communications. Connections required by an application are partitioned into a number of subsets called mappings, such that connections in each mapping can be established in a MI ..."
Abstract
-
Cited by 34 (29 self)
- Add to MetaCart
In this paper, time-division multiplexed multistage interconnection networks (TDM-MINs) are proposed for multiprocessor communications. Connections required by an application are partitioned into a number of subsets called mappings, such that connections in each mapping can be established in a MIN without conflict. Switch settings for establishing connections in each mapping are determined and stored in shift registers. By repeatedly changing switch settings, connections in each mapping are established for a time slot in a round-robin fashion. Thus, all connections required by an application may be established in a MIN in a time-division multiplexed way. TDM-MINs can emulate a completely connected network using N time slots. It can also emulate regular networks such as rings, meshes, Cube-Connected-Cycles (CCC), binary trees and n -dimensional hypercubes using 2, 4, 3, 4 and n time slots, respectively. The problem of partitioning an arbitrary set of requests into a minimal ...
The Impact of Communication Locality on Large-Scale Multiprocessor Performance
, 1992
"... As multiprocessor sizes scale and computer architects turn to interconnection networks with non-uniform communication latencies, the lure of exploiting communication locality to increase performance becomes inevitable. Models that accurately quantify locality effects provide invaluable insight into ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
As multiprocessor sizes scale and computer architects turn to interconnection networks with non-uniform communication latencies, the lure of exploiting communication locality to increase performance becomes inevitable. Models that accurately quantify locality effects provide invaluable insight into the importance of exploiting locality as machine sizes and features change. This paper presents a framework for modeling the impact of communication locality on system performance. The framework provides a means for combining simple models of application, processor, and network behavior to obtain a combined model that accurately reflects feedback effects between processors and networks. We introduce a model that characterizes application behavior with three parameters that capture computation grain, sensitivity to communication latency, and amount of locality present at execution time. The combined model is validated with measurements taken from a detailed simulator for a complete multiprocessor system. Using the combined model, we show that exploiting communication locality provides gains which are at most linear in the factor by which average communication distance is reduced when the number of outstanding communication transactions per processor is bounded. The combined model is also used to obtain rough upper bounds on the performance improvement from exploiting locality to minimize communication distance. 1
Chaotic Routing - Design and Implementation of an Adaptive Multicomputer Network Router
, 1993
"... Chaotic Routing -- Design and Implementation of an Adaptive Multicomputer Network Router by Kevin Bolding Chairperson of Supervisory Committee: Professor Lawrence Snyder Department of Computer Science and Engineering A crucial component of a massively parallel multicomputer is the interconnection n ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Chaotic Routing -- Design and Implementation of an Adaptive Multicomputer Network Router by Kevin Bolding Chairperson of Supervisory Committee: Professor Lawrence Snyder Department of Computer Science and Engineering A crucial component of a massively parallel multicomputer is the interconnection network which links all of the nodes of the computer together. This network provides the primary method of communication between the hundreds or thousands of processing nodes and is, thus, critical to the successful operation of the multicomputer. Current state-of-the-art interconnection networks use simple, oblivious routing techniques which achieve very good performance when loading is light, but do not perform well in the presence of non-uniform congestion or faults. Chaotic routing, a non-minimal adaptive routing technique, provides a mechanism which takes into account the presence of congestion and faults when choosing a path for a message and can, thus, achieve better performance. Chaot...
Fast Algorithms for Routing Around Faults in Multibutterflies and Randomly-Wired Splitter Networks
- IEEE Transactions on Computers
, 1992
"... This paper describes simple deterministic O(log N)-step algorithms for routing permutations of packets in multibutterflies and randomlywired splitter networks. The algorithms are robust against faults (even in the worst case), and are efficient from a practical point of view. As a consequence, we fi ..."
Abstract
-
Cited by 25 (8 self)
- Add to MetaCart
This paper describes simple deterministic O(log N)-step algorithms for routing permutations of packets in multibutterflies and randomlywired splitter networks. The algorithms are robust against faults (even in the worst case), and are efficient from a practical point of view. As a consequence, we find that the multibutterfly is an excellent candidate for a high-bandwidth low-diameter switching network underlying a sharedmemory machine. Index Terms--Fault tolerance, interconnection network, multibutterfly, multistage network, routing algorithm. 1 Introduction Networks derived from hypercubes form the architectural basis of most parallel computers, including machines such as the BBN Butterfly, the Connection Machine, the IBM RP3 and GF11, the Intel iPSC, and the NCUBE. The butterfly, in particular, is quite popular, and has been demonstrated to perform reasonably well in practice. An example of an 8-input butterfly is illustrated in Figure 1. The nodes in this graph represent switches,...
Data Forwarding in Scalable Shared-Memory Multiprocessors
- In Proceedings of the 1995 International Conference on Supercomputing
, 1995
"... Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that a slightlyoptimistic support for forwarding speeds up five applications by, on average, 50% for large caches and 30% for small caches. For large caches, many sharing read misses can be eliminated, while for smaller caches, forwarding ...
The Performance of the Cedar Multistage Switching Network
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... While multistage switching networks for vector multiprocessors have been studied extensively, detailed evaluations of their performance are rare. Indeed, analytical models, simulations with pseudo-synthetic loads, studies focused on average-value parameters, and measurements of networks disconnected ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
While multistage switching networks for vector multiprocessors have been studied extensively, detailed evaluations of their performance are rare. Indeed, analytical models, simulations with pseudo-synthetic loads, studies focused on average-value parameters, and measurements of networks disconnected from the machine, all provide limited information. In this paper, instead, we present an in-depth empirical analysis of a multistage switching network in a realistic setting: we use hardware probes to examine the performance of the omega network of the Cedar shared-memory machine executing real applications. The machine is configured with 16 vector processors. The analysis suggests that the performance of multistage switching networks is limited by traffic non-uniformities. We identify two major non-uniformities that degrade Cedar's performance and are likely to slow down other networks too. The first one is the contention caused by the return messages in a vector access as they converge fr...

