Results 1 - 10
of
19
Separating Data and Control Transfer in Distributed Operating Systems
- In Sixth International Conference on Architecture Support for Programming Languages and Operating Systems
, 1994
"... Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and greatly increased reliability, when compared to ..."
Abstract
-
Cited by 69 (2 self)
- Add to MetaCart
Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and greatly increased reliability, when compared to current LANs such as Ethernet. We believe that these new network and processor technologies will permit tighter coupling of distributed systems at the hardware level, and that distributed systems software should be designed to benefit from that tighter coupling. In this paper, we propose an alternative way of structuring distributed systems that takes advantage of a communication model based on remote network access (reads and writes) to protected memory segments. A key feature of the new structure, directly supported by the communication model, is the separation of data transfer and control transfer. This is in contrast to the structure of traditional distributed systems, which are typical...
On Active Networking and Congestion
, 1996
"... Active networking offers a change in the usual network paradigm: from passive carrier of bits to a more general computation engine. The implementation of such a change is likely to enable radical new applications that cannot be foreseen today. Large-scale deployment, however, involves significant ch ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
Active networking offers a change in the usual network paradigm: from passive carrier of bits to a more general computation engine. The implementation of such a change is likely to enable radical new applications that cannot be foreseen today. Large-scale deployment, however, involves significant challenges in interoperability and security. Less clear, perhaps, are the "immediate" benefits of such a paradigm shift, and how they might be used to justify migration towards active networking. In this paper, we focus on the benefits of active networking with respect to a problem that is unlikely to disappear in the near future: network congestion. In particular, we consider application-specific processing of user data within the network at congested nodes. Given an architecture in which applications can specify intra-network processing, the bandwidth allocated to each application's packets can be reduced in a manner that is tailored to the application, rather than being applied generically....
Evolution of the Virtual Interface Architecture
- IEEE Computer
, 1998
"... this article, we describe the architectural issues and ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
this article, we describe the architectural issues and
Cranium: An Interface for Message Passing on Adaptive Packet Routing Networks
- Proceedings of Parallel Computer Routing and Communication Workshop
, 1994
"... . Cranium is a processor-network interface for an interconnection network based on adaptive packet routing. Adaptive networks relax the restriction that packet order is preserved; packets may be delivered to their destinations in an arbitrary sequence. Cranium uses two mechanisms: an automatic-recei ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
. Cranium is a processor-network interface for an interconnection network based on adaptive packet routing. Adaptive networks relax the restriction that packet order is preserved; packets may be delivered to their destinations in an arbitrary sequence. Cranium uses two mechanisms: an automatic-receive interface for packet serialization and high performance, and a processor-initiated interface for flexibility. To minimize software overhead, Cranium is directly accessible by user-level programs. Protection for user-level message passing is implemented by mapping user-level handles into physical node identifiers and buffer addresses. 1 Introduction Scalable multicomputer architectures have been converging on a standard organization with four elements: a workstation microprocessor, main memory based on dynamic RAM, a point-to-point interconnection network and a processornetwork interface. Both the microprocessors and DRAM chips have become inexpensive and widely available. Multicomputer ...
Multiprocessor Runtime Support for Fine-Grained, Irregular DAGs
- In Rajiv K. Kalia and Priya Vashishta, editors, Toward Teraflop Computing and New Grand Challenge Applications
, 1995
"... We examine multiprocessor runtime support for #ne-grained, irregular directed acyclic graphs #DAGs# such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allowustoachieve unprecedented speedups for a ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
We examine multiprocessor runtime support for #ne-grained, irregular directed acyclic graphs #DAGs# such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allowustoachieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision #oating point operations per second. Weachieve these speedups with non-matrix-speci#c methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and...
A Streaming Multi-Threaded Model
- In Proceedings of the Third Workshop on Media and Stream Processors
, 2001
"... We present SCORE (Stream Computations Organized for Reconfigurable Execution), a multi-threaded model that relies on streams to expose thread parallelism and to enable e#cient scheduling, low-overhead communication, and scalability. We present work to-date on SCORE for scalable reconfigurable log ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
We present SCORE (Stream Computations Organized for Reconfigurable Execution), a multi-threaded model that relies on streams to expose thread parallelism and to enable e#cient scheduling, low-overhead communication, and scalability. We present work to-date on SCORE for scalable reconfigurable logic, as well as implementation ideas for SCORE for processor architectures. We demonstrate that streams can be exposed as a clean architectural feature that supports forward compatibility to larger, more parallel hardware. 1. OVERVIEW For the past several decades, the predominant architectural abstraction for programmable computation systems has been the instruction set architecture (ISA). An ISA defines an instruction set and semantics for executing it. A key benefit of the ISA model is that those semantics decouple software from hardware development. A piece of software, written and compiled once, is guaranteed to run on any ISA-compatible device. This guarantee allows hardware to evolve...
The Sensitivity of Communication Mechanisms to Bandwidth and Latency
"... The goal of this paper is to gain insight into the relative performance of communication mechanisms as bisection bandwidth and network latency vary. We compare shared memory with and without prefetching, message passing with interrupts and with polling, and bulk transfer via DMA. We present two sets ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
The goal of this paper is to gain insight into the relative performance of communication mechanisms as bisection bandwidth and network latency vary. We compare shared memory with and without prefetching, message passing with interrupts and with polling, and bulk transfer via DMA. We present two sets of experiments involving four irregular applications on the MIT Alewife multiprocessor. First, we introduce I/O cross-traffic to vary bisection bandwidth. Second, we change processor clock speeds to vary relative network latency. We establish a framework from which to understand a range of results. On Alewife, shared memory provides good performance, even on producer-consumer applications with little datareuse. On machines with lower bisection bandwidth and higher network latency, however, message-passing mechanisms become important. In particular, the high communication volume of shared memory threatens to become difficult to support on future machines without expensive, high-dimensional networks. Furthermore, the round-trip nature of shared memory may not be able to tolerate the latencies of future networks.
Metro: A Router Architecture for High-Performance, Short-Haul Routing Networks
- Computer Architecture News (Special Issue ISCA`21 Proceedings
, 1994
"... The Multipath Enhanced Transit Router Organization (metro) is a exible routing architecture for high-performance, tightly-coupled, multiprocessors and routing hubs. A metro router is a dilated crossbar routing component supporting half-duplex bidirectional, pipelined, circuit-switched connections. E ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
The Multipath Enhanced Transit Router Organization (metro) is a exible routing architecture for high-performance, tightly-coupled, multiprocessors and routing hubs. A metro router is a dilated crossbar routing component supporting half-duplex bidirectional, pipelined, circuit-switched connections. Each metro router is self-routing and supports dynamic message tra c. The routers works in conjunction with source-responsible network interfaces to achieve reliable end-to-end data transmission in the presence of heavy network congestion and dynamic faults. metro separates the fundamental architectural characteristics from implementation parameters. Simplicity of routing function coupled with freedom in the implementation parameters allows metro implementations to fully exploit available technology to achieve low-latency and high-bandwidth. We illustrate the e ects of this implementation freedom by summarizing the performance which various metro con gurations can extract from some modern CMOS technologies. Included in our illustrations is metrojr-orbit, a minimal instance of the metro architecture weconstructed ina 1:2 gate-array technology.
Parallel Timing Simulation on a Distributed Memory Multiprocessor
- In International Conference on CAD
, 1993
"... We present a parallel timing simulator, PARSWEC, that exploits speculative parallelism and runs on a distributed memory multiprocessor. It is based on an event-driven timing simulator called SWEC. Our approach uses optimistic scheduling to take advantage of the latency of digital signals. Using data ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
We present a parallel timing simulator, PARSWEC, that exploits speculative parallelism and runs on a distributed memory multiprocessor. It is based on an event-driven timing simulator called SWEC. Our approach uses optimistic scheduling to take advantage of the latency of digital signals. Using data from trace-driven analysis, we demonstrate that optimistic scheduling exploits more parallelism than conservative scheduling for circuits with feedback signal paths. We then describe the PARSWEC implementation and discuss several design trade-offs. Speedups over SWEC on large circuits are as high as 55 on a 64-node CM5 multiprocessor. These results indicate the feasibility of using distributed memory multiprocessors for largescale circuit simulation. 1 Introduction We present a parallel timing simulator, PARSWEC, developed for distributed memory multiprocessors. PARSWEC is a parallelization of SWEC [1], an eventdriven timing simulator. SWEC employs a stepwise linear waveform and device mo...
Analyzing the Performance of MPI in a Cluster of Workstations based on Fast Ethernet
- In Fourth European PVM/MPI User's Group Meeting
, 1997
"... Recentimprovements in LANs make network of workstations a good alternative to traditional parallel computers in some applications. ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Recentimprovements in LANs make network of workstations a good alternative to traditional parallel computers in some applications.

