Results 1  10
of
59
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
, 1999
"... Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported ..."
Abstract

Cited by 206 (4 self)
 Add to MetaCart
Devices]: Modes of ComputationParallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors' addresses: Y.K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: ykwok@eee.hku.hk; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. 2000 ACM 03600300/99/12000406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999 1.
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 193 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Limits on Interconnection Network Performance
 IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in highspeed multiprocessor systems, the parameters of highperformance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract

Cited by 176 (4 self)
 Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in highspeed multiprocessor systems, the parameters of highperformance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
Performance Tradeoffs In Multithreaded Processors
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1991
"... ... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors th ..."
Abstract

Cited by 120 (5 self)
 Add to MetaCart
... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors that includes cache interference, network contention, contextswitching overhead, and datasharing effects. The model is validated through our own simulations and by comparison with previously published simulation results. Our results indicate that processors can substantially benefit from multithreading, even in systems with small caches. Large caches yield close to full processor utilization with as few as two to four contexts, while small caches may require up to four times as many contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the contextswitching overhead limits the best possible utilization.
A Unified Theory Of Interconnection Network Structure
 Theoretical Computer Science
, 1986
"... The relationship between the topology of interconnection networks and their functional properties is examined. Graph theoretical characterizations are derived for delta networks, which have a simple routing scheme, and for bidelta networks, which have the delta property in both directions. Delta net ..."
Abstract

Cited by 42 (0 self)
 Add to MetaCart
The relationship between the topology of interconnection networks and their functional properties is examined. Graph theoretical characterizations are derived for delta networks, which have a simple routing scheme, and for bidelta networks, which have the delta property in both directions. Delta networks are shown to have a recursive structure. Bidelta networks are shown to have a unique topology. The definition of bidelta network is used to derive in a uniform manner the labeling schemes that define the omega networks, indirect binary cube networks, flip networks, baseline networks, modified data manipulators, and two new networks; these schemes are generalized to arbitrary radices. The labeling schemes are used to characterize networks with simple routing. In another paper, we characterize the networks with optimal performance/cost ratio. Only the multistage shuffleexchange networks have both optimal performance/cost ratio and simple routing. This helps explain why few fundamentally...
An Optical MultiMesh Hypercube: A Scalable Optical Interconnection Network for Massively Parallel Computing
 IEEEOSA Journal of Lightwave Technology
, 1994
"... A new interconnection network for massively parallel computing is introduced. This network is called an Optical MultiMesh Hypercube (OMMH) network. The OMMH integrates positive features of both hypercube (small diameter, high connectivity, symmetry, simple control and routing, fault tolerance, etc. ..."
Abstract

Cited by 25 (11 self)
 Add to MetaCart
A new interconnection network for massively parallel computing is introduced. This network is called an Optical MultiMesh Hypercube (OMMH) network. The OMMH integrates positive features of both hypercube (small diameter, high connectivity, symmetry, simple control and routing, fault tolerance, etc.) and mesh (constant node degree and scalability) topologies and at the same time circumvents their limitations (e.g., the lack of scalability of hypercubes, and the large diameter of meshes). The OMMH can maintain a constant node degree regardless of the increase in the network size. In addition, the flexibility of the OMMH network makes it wellsuited for optical implementations. This paper presents the OMMH topology, analyzes its architectural properties and potentials for massively parallel computing, and compares it to the hypercube. Moreover, it also presents a threedimensional optical design methodology based on freespace optics. The proposed optical implementation has totally space...
Routing in Multihop Packet Switching Networks: Gbps Challenge
 IEEE Network
, 1995
"... The paper is a survey of networking solutions that have been proposed for highspeed packetswitched applications. Using these solutions as examples, we identify the specific problems resulting from very high transmission rates and explain how these problems influence the design of highspeed networ ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
The paper is a survey of networking solutions that have been proposed for highspeed packetswitched applications. Using these solutions as examples, we identify the specific problems resulting from very high transmission rates and explain how these problems influence the design of highspeed networks and protocols. We conclude that the solutions based on deflection routing are the most promising ones and we suggest a number of directions for their evolution. 1 Introduction Not so long ago, computer networks with high transmission rates (e.g. several Mb/s) were naturally confined to local domains. Although such (and higher) transmission rates were available in telephony on long distances, they were used on a pointtopoint basis. Concepts of highlyconnected fast networks spanning geographical areas larger than the acreage typically covered by a single institution are relatively new and, besides the emerging atm technology, there are no standard commercially available solutions that c...
PacketSwitched vs. TimeMultiplexed FPGA Overlay Networks
 in Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines. IEEE
, 2006
"... Abstract — Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require ..."
Abstract

Cited by 16 (7 self)
 Add to MetaCart
Abstract — Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PEtoPE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PE’s operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between timemultiplexed and packetswitched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V60004 at 166MHz. For our applications, timemultiplexed, offline scheduling offers up to a 63 % performance increase over online, packetswitched scheduling for equivalent topologies. When applying designs to equivalent area, packetswitching is up to 2 × faster for small area designs while timemultiplexing is up to 5 × faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, timemultiplexed routing outperforms packetswitching; however when the active set of links drops below 40 % of the potential links, packetswitched routing can outperform timemultiplexing. I.
Personal communication
, 2007
"... The nodes of a rotator graph are the permutations of n, and an arc is directed from u to v if the first r symbols of u can be rotated one position to the left to obtain v. Restricted rotator graphs restrict the allowable rotations to r ∈ R for some R ⊆ {2, 3,..., n}. Incomplete rotator graphs only i ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
The nodes of a rotator graph are the permutations of n, and an arc is directed from u to v if the first r symbols of u can be rotated one position to the left to obtain v. Restricted rotator graphs restrict the allowable rotations to r ∈ R for some R ⊆ {2, 3,..., n}. Incomplete rotator graphs only include nodes whose final symbol is i ≤ m for a fixed maximum value m ∈ {1, 2,..., n}. Restricted rotator graphs are directed Cayley graphs, whereas incomplete rotator graphs are not Cayley graphs. Hamilton cycles exist for rotator graphs (Corbett 1992), restricted rotator graphs with R = {n−1, n} (Ruskey and Williams 2010), and incomplete rotator graphs for all m (Ponnuswamy and Chaudhary 1994). These previous results are based on sequence building operations that we name ‘reusing’, ‘recycling’, and ‘rewinding’. In this article, we combine these operations to create Hamilton cycles in rotator graphs that are (1) restricted by R = {2, 3, n}, (2) restricted by R = {2, 3, n−1, n} and incomplete for any m, and (3) restricted by R = {n−2, n−1, n} and incomplete for any m. Result (1) is ‘optimal ’ since restricted rotator graphs are not strongly connected for R = {3, n} when n is odd, and do not have Hamilton cycles for R = {2, n} when n is even (Rankin 1944, Swan 1999). Similarly, we prove (3) is ‘optimal’. Our Hamilton cycles can be easily implemented for potential applications, and we provide O(1)time algorithms that generate successive rotations for (1)–(3). Submitted: