Results 1 - 10
of
114
Limits on Interconnection Network Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract
-
Cited by 166 (4 self)
- Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
Compiler-directed Data Prefetching in Multiprocessors with Memory Hierarchies
- In International Conference on Supercomputing
, 1990
"... Memory hierarchies are used by multiprocessor systems to reduce large memory access times. It is necessary to automatically manage such a hierarchy, to obtain effective memory utilization. In this paper, we discuss the various issues involved in obtaining an optimal memory management strategy for a ..."
Abstract
-
Cited by 87 (7 self)
- Add to MetaCart
Memory hierarchies are used by multiprocessor systems to reduce large memory access times. It is necessary to automatically manage such a hierarchy, to obtain effective memory utilization. In this paper, we discuss the various issues involved in obtaining an optimal memory management strategy for a memory hierarchy. We present an algorithm for finding the earliest point in a program that a block of data can be prefetched. This determination is based on the control and data dependences in the program. Such a method is an integral part of more general memory management algorithms. We demonstrate our method's potential by using static analysis to estimate the performance improvement afforded by our prefetching strategy and to analyze the reference patterns in a set of Fortran benchmarks. We also study the effectiveness of prefetching in a realistic shared-memory system using an RTL-level simulator and real codes. This differs from previous studies by considering prefetching benefits in th...
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also pr ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request.
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures
- IEEE Trans. on Circuits and Systems
, 1990
"... ..."
MULTIPROCESSOR SCHEDULING TO ACCOUNT FOR INTERPROCESSOR COMMUNICATION
, 1991
"... Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essenti ..."
Abstract
-
Cited by 64 (11 self)
- Add to MetaCart
Interprocessor communication (PC) overheads have emerged as the major performance limitation in parallel processing systems, due to the transmission delays, synchronization overheads, and conflicts for shared communication resources created by data exchange. Accounting for these overheads is essential for attaining efficient hardware utilization. This thesis introduces two new compile-time heuristics for scheduling precedence graphs onto multiprocessor architectures, which account for interprocessor communication overheads and interconnection constraints in the architecture. These algorithms perform scheduling and routing simultaneously to account for irregular interprocessor interconnections, and schedule all communications as well as all computations to eliminate shared resource contention. The first technique, called dynamic-level scheduling, modifies the classical HLFET list scheduling strategy to account for IPC and synchronization overheads. By using dynamically changing priorities to match nodes and processors at each step, this technique attains an equitable tradeoff between load balancing and interprocessor communication cost. This method is fast, flexible, widely targetable, and displays promising perforrnance. The second technique, called declustering, establishes a parallelism hierarchy upon the precedence graph using graph-analysis techniques which explicitly address the tradeoff between exploiting parallelism and incurring communication cost. By systematically decomposing this hierarchy, the declustering process exposes parallelism instances in order of importance, assuring efficient use of the available processing resources. In contrast with traditional clustering schemes, this technique can adjust the level of cluster granularity to suit the characteristics of the specified architecture, leading to a more effective solution.
Parallel Database Systems: The Future of Database Processing or a Passing Fad?
- SIGMOD RECORD
, 1991
"... Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper reviews th ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
Parallel database machine architectures have evolved from the use of exotic hardware to a software parallel dataflow architecture based on conventional shared-nothing hardware. These new designs provide impressive speedup and scaleup when processing relational database queries. This paper reviews the techniques used by such systems, and surveys current commercial and research systems.
Increasing the number of strides for conflict-free vector access
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of s ..."
Abstract
-
Cited by 42 (11 self)
- Add to MetaCart
Address transformation schemes, such as skewing and linear transformations, have been proposed to achieve conflict-free vector access for some strides in vector processors with multi-module memories. In this paper, we extend these schemes to achieve this conflict-free access for a larger number of strides. The basic idea is to perform an out-of-order access to vectors of fixed length, equal to that of the vector registers of the processor. Both matched and unmatched memories are considered; we show that the number of strides is even larger for the latter case. The hardware for address calculations and access control is described and shown to be of similar complexity as that required for access in order.
A Unified Theory Of Interconnection Network Structure
- Theoretical Computer Science
, 1986
"... The relationship between the topology of interconnection networks and their functional properties is examined. Graph theoretical characterizations are derived for delta networks, which have a simple routing scheme, and for bidelta networks, which have the delta property in both directions. Delta net ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
The relationship between the topology of interconnection networks and their functional properties is examined. Graph theoretical characterizations are derived for delta networks, which have a simple routing scheme, and for bidelta networks, which have the delta property in both directions. Delta networks are shown to have a recursive structure. Bidelta networks are shown to have a unique topology. The definition of bidelta network is used to derive in a uniform manner the labeling schemes that define the omega networks, indirect binary cube networks, flip networks, baseline networks, modified data manipulators, and two new networks; these schemes are generalized to arbitrary radices. The labeling schemes are used to characterize networks with simple routing. In another paper, we characterize the networks with optimal performance/cost ratio. Only the multistage shuffle-exchange networks have both optimal performance/cost ratio and simple routing. This helps explain why few fundamentally...
Reconfiguration With Time Division Multiplexed MINs for Multiprocessor Communications
- IEEE Transactions on Parallel and Distributed Systems
, 1994
"... In this paper, time-division multiplexed multistage interconnection networks (TDM-MINs) are proposed for multiprocessor communications. Connections required by an application are partitioned into a number of subsets called mappings, such that connections in each mapping can be established in a MI ..."
Abstract
-
Cited by 34 (29 self)
- Add to MetaCart
In this paper, time-division multiplexed multistage interconnection networks (TDM-MINs) are proposed for multiprocessor communications. Connections required by an application are partitioned into a number of subsets called mappings, such that connections in each mapping can be established in a MIN without conflict. Switch settings for establishing connections in each mapping are determined and stored in shift registers. By repeatedly changing switch settings, connections in each mapping are established for a time slot in a round-robin fashion. Thus, all connections required by an application may be established in a MIN in a time-division multiplexed way. TDM-MINs can emulate a completely connected network using N time slots. It can also emulate regular networks such as rings, meshes, Cube-Connected-Cycles (CCC), binary trees and n -dimensional hypercubes using 2, 4, 3, 4 and n time slots, respectively. The problem of partitioning an arbitrary set of requests into a minimal ...
Relaxation-Based Electrical Simulation
- IEEE Tr. on Electronic Devices
, 1983
"... Abstract-Circuit simulation programs have proven to be most in-portant computer-aided design tools for the analysis of the electri1:al performance of integrated circuits. One of the most common analy!es performed by circuit simulators and the most expensive in terms of computer time is nonlinear tim ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Abstract-Circuit simulation programs have proven to be most in-portant computer-aided design tools for the analysis of the electri1:al performance of integrated circuits. One of the most common analy!es performed by circuit simulators and the most expensive in terms of computer time is nonlinear time-domain transient analysis. Conventional circuit simulators were designed initially for the cost-effective analysis of circuits containing a few hundred transistors or less. Because of the need.to verify the performance of larger circuits, many users have successfully simulated circuits containing thousands of transistors despite the cost. Recently, a new class of algorithms has been applied to the electrical IC simulation problem. New simulators using these methods provide accurate waveform information with up to two orders of magnitude speed impro {e-ment for large circuits. These programs use relaxation methods for 1he solution of the set of ordinary differential equations, which describe lhe circuit under analysis, rather than the direct sparse-matrix methods on which standard circuit simulators are based. In this paper, the techniques used in relaxation-based electrical simula-tion are presented in a rigorous and unified framework, and the numerical properties of the various methods are explored. Both the advantages 2nd the limitations of these techniques for the analysis of large IC's are described.

