Results 1 - 10
of
20
The Tera computer system
- In International Conference on Supercomputing
, 1990
"... The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This ..."
Abstract
-
Cited by 351 (2 self)
- Add to MetaCart
The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This
Limits on Interconnection Network Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract
-
Cited by 166 (4 self)
- Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
- In Proceedings of Workshop on Scalable Shared Memory Multiprocessors
, 1991
"... The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory arch ..."
Abstract
-
Cited by 138 (22 self)
- Add to MetaCart
The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory architecture, Alewife allows efficient shared-memory programming through a multilayered approach to locality management. A new scalable cache-coherence scheme called LimitLESS directories allows the use of caches for reducing communication latency and network bandwidth requirements. Alewife also employs run-time and compile-time methods for partitioning and placement of data and processes to enhance communication locality. While the above methods attempt to minimize communication latency, communication with distant processors cannot be completely avoided. Alewife's processor, Sparcle, is designed to tolerate these latencies by rapidly switching between threads of computation. This paper describe...
Processor Coupling: Integrating Compile Time and Runtime Scheduling for Parallelism
- In 19th Annual International Symposium on Computer Architecture
, 1992
"... The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling,a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-thread parallelism, by using compile time and runt ..."
Abstract
-
Cited by 76 (9 self)
- Add to MetaCart
The technology to implement a single-chip node composed of 4 high-performance floating-point ALUs will be available by 1995. This paper presents processor coupling,a mechanism for controlling multiple ALUs to exploit both instruction-level and inter-thread parallelism, by using compile time and runtime scheduling. The compiler statically schedules individual threads to discover available intra-thread instruction-level parallelism. The runtime scheduling mechanism interleaves threads, exploiting inter-thread parallelism to maintain high ALU utilization. ALUs are assigned to threads on a cycle by cycle basis, and several threads can be active concurrently. We provide simulation results demonstrating that, on four simple numerical benchmarks, processor coupling achieves better performance than purely statically scheduled or multi-processor machine organizations. We examine how performance is affected by restricted communication between ALUs and by long memory latencies. We also present an implementation and feasibility study of a processor coupled node.
The Impact of Synchronization and Granularity on Parallel Systems
- In Int'l. Symp. on Computer Architecture
, 1990
"... In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ult ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration level parallelism seems to be a more appropriate level when those factors are considered. We also study barrier synchronization and data synchronization at the loopiteration level and found both schemes are needed for a better performance.
Memtracker: Efficient and programmable support for memory access monitoring and debugging
- In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, Vol., Iss., Feb. 2007
, 2007
"... Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Unfortunately, existing software and even hardware approaches for finding and identifying memory bugs have considerable performanc ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Unfortunately, existing software and even hardware approaches for finding and identifying memory bugs have considerable performance overheads, target only a narrow class of bugs, are costly to implement, or use computational resources inefficiently. This paper describes MemTracker, a new hardware support mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker associates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. Mem-Tracker’s rich set of states, events, and transitions can be used to implement different monitoring and debugging checkers with minimal performance overheads, even when frequent state updates are needed. To evaluate MemTracker, we map three different checkers onto it, as well as a checker that combines all three. For the most demanding (combined) checker, we observe performance overheads of only 2.7 % on average and 4.8 % worst-case on SPEC 2000 applications. Such low overheads allow continuous (always-on) use of MemTrackerenabled checkers even in production runs. 1.
Multithreaded Architectures: Principles, Projects and Issues
, 1994
"... The architecture of future high performance computer systems will respond to the possibilities offered by technology and to the increasing demand for attention to issues of programmability. Multithreaded processing element architectures are a promising alternative to RISC architecture and its multip ..."
Abstract
-
Cited by 23 (12 self)
- Add to MetaCart
The architecture of future high performance computer systems will respond to the possibilities offered by technology and to the increasing demand for attention to issues of programmability. Multithreaded processing element architectures are a promising alternative to RISC architecture and its multiple-instruction-issue extensions such as VLIW, superscalar, and superpipelined architectures. This paper presents an overview of multithreaded computer architectures and the technical issues affecting their prospective evolution. We introduce the basic concepts of multithreaded computer architecture and describe several architectures representative of the design space for multithreaded, parallel computers. We review design issues for multithreaded processing elements intended for use as the node processor of parallel computers for scientific computing. These include the question of choosing an appropriate program execution model, the organization of the processing element to achieve good utilization of major resources, support for fine-grain interprocessor communication and global memory access, compiling machine code for multithreaded processors, and the challenge of implementing virtual memory in large-scale multiprocessor systems.
MIMD-Style Parallel Programming Based on Continuation-Passing Threads
- MASSACHUSETTS INSTITUTE OF TECHNOLOGY, LABORATORY FOR COMPUTER SCIENCE
, 1994
"... Today's message passing architectures are characterized by high communication costs and they typically lack hardware support for synchronization and scheduling. These deficiencies present a severe obstacle to obtaining efficient implementations of parallel applications whose communication patterns a ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Today's message passing architectures are characterized by high communication costs and they typically lack hardware support for synchronization and scheduling. These deficiencies present a severe obstacle to obtaining efficient implementations of parallel applications whose communication patterns are either highly irregular or dependent on dynamic information. In this paper we present a model based on continuation-passing threads in which we try to overcome these difficulties. The model incorporates two effective software mechanisms targeted towards lengthening sequential threads in order to offset the costs of dynamic scheduling, and towards preserving the locality of computations to reduce the network traffic. The model is currently implemented as a C language extension along with a runtime system implemented on the CM-5 that embodies a work stealing scheduler. Real world applications written in this package, such as ray-tracing and protein folding, have shown impressive speedup res...
On Memory Models and Cache Management for Shared-Memory Multiprocessors
, 1995
"... A popular approach to designing shared-memory computer systems is to specify a memory model upon which a variety of program execution models may be implemented. Alternatively, one may choose a desired program execution model (PXM) and specify a memory model suited to the PXM. We argue that this s ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
A popular approach to designing shared-memory computer systems is to specify a memory model upon which a variety of program execution models may be implemented. Alternatively, one may choose a desired program execution model (PXM) and specify a memory model suited to the PXM. We argue that this second approach is to be preferred because it avoids the trap of specifying features of the memory model (consistency, for example) that may not needed to implement a desired program execution model. If the PXM is a dataflow model (one based on or equivalent to recursive dataflow program graphs), then no cache consistency problem need arise if the memory model supports synchronizing memory operations. Then why use a memory consistency model as a basis for designing shared-memory multiprocessors? One argument is that a general memory model can support a variety of PXMs. However, many good PXMs, object-oiented programming, for example, may be built on top of a basic program model that d...
Superthreading: Integrating compilation technology and processor architecture for cost-effective concurrent multithreading
- JOURNAL OF INFORMATION SCIENCE AND ENGINEERING
, 1998
"... This thesis presents a concurrent multiple-threaded architectural model, called superthreading, for exploiting fine-grained thread-level parallelism on a processor. This architectural model adopts a thread pipelining execution model that allows threads with data dependences and control dependences t ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This thesis presents a concurrent multiple-threaded architectural model, called superthreading, for exploiting fine-grained thread-level parallelism on a processor. This architectural model adopts a thread pipelining execution model that allows threads with data dependences and control dependences to be executed in parallel. The basic idea of thread pipelining is to compute and forward recurrence data and possible dependent store addresses to the next thread as soon as possible, so the next thread can start execution and perform run-time data dependence checking. Thread pipelining also forces contiguous threads to perform their memory write-backs in order, which enables the compiler to fork threads with control speculation. With run-time support for data dependence checking and control speculation, the superthreaded architecture can exploit loop-level and instruction-level parallelism from a broad range of applications. In this thesis we also present the compiler techniques for superthreaded processors. Many existing compiler techniques used in traditional parallelizing compilers for multiprocessors as well as some specific compiler techniques for superthreaded processors are needed for generating su-perthreaded codes and enhancing parallelism between threads. We evaluate the performance of the superthreaded architecture with a trace-driven, cycle-by-cycle superthreaded processor simulator by using codes transformed by hand and codes generated by our superthreading compiler proto-type. The simulation results show that a superthreaded processor can achieve good performance by exploiting both thread-level and instruction-level parallelism in programs.

