Results 1 - 10
of
16
Synchronization and Communication in the T3E Multiprocessor
, 1996
"... This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include pe ..."
Abstract
-
Cited by 127 (1 self)
- Add to MetaCart
This paper describes the synchronization and communication primitives of the Cray T3E multiprocessor, a shared memory system scalable to 2048 processors. We discuss what we have learned from the T3D project (the predecessor to the T3E) and the rationale behind changes made for the T3E. We include performance measurements for various aspects of communication and synchronization. The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers). E-registers are used as the source or target for all remote communication. They provide a highly pipelined interface to global memory that allows dozens of requests per processor to be outstanding. Through E-registers, the T3E provides a rich set of atomic memory operations and a flexible, user-level messaging facility. The T3E also provides a set of virtual hardware barrier/ eureka networks that can be arbitrarily embedded into the 3D torus interconnect.
The Multiscalar Architecture
, 1993
"... The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent t ..."
Abstract
-
Cited by 113 (8 self)
- Add to MetaCart
The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits fine-grain parallelism by executing multiple, possibly (control and/or data) depen-dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess-ing paradigms, and shares a number of properties of the sequential processing model and the dataflow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis-calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen-tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul-tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an efficient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the
Algorithm + Strategy = Parallelism
- JOURNAL OF FUNCTIONAL PROGRAMMING
, 1998
"... The process of writing large parallel programs is complicated by the need to specify both the parallel behaviour of the program and the algorithm that is to be used to compute its result. This paper introduces evaluation strategies, lazy higher-order functions that control the parallel evaluation of ..."
Abstract
-
Cited by 51 (18 self)
- Add to MetaCart
The process of writing large parallel programs is complicated by the need to specify both the parallel behaviour of the program and the algorithm that is to be used to compute its result. This paper introduces evaluation strategies, lazy higher-order functions that control the parallel evaluation of non-strict functional languages. Using evaluation strategies, it is possible to achieve a clean separation between algorithmic and behavioural code. The result is enhanced clarity and shorter parallel programs. Evaluation strategies are a very general concept: this paper shows how they can be used to model a wide range of commonly used programming paradigms, including divideand -conquer, pipeline parallelism, producer/consumer parallelism, and data-oriented parallelism. Because they are based on unrestricted higher-order functions, they can also capture irregular parallel structures. Evaluation strategies are not just of theoretical interest: they have evolved out of our experience in parallelising several large-scale applications, where they have proved invaluable in helping to manage the complexities of parallel behaviour. These applications are described in detail here. The largest application we have studied to date, Lolita, is a 60,000 line natural language parser. Initial results show that for these applications we can achieve acceptable parallel performance, while incurring minimal overhead for using evaluation strategies.
Building Multithreaded Architectures with Off-the-Shelf Microprocessors
, 1993
"... Current strategies for supporting high-performance parallel computing often face the problem of large software overheads for process switching and interprocessor communication. This document presents the design of the Multi-Threaded Architecture (MTA) model, a multiprocessor architecture designed fo ..."
Abstract
-
Cited by 34 (22 self)
- Add to MetaCart
Current strategies for supporting high-performance parallel computing often face the problem of large software overheads for process switching and interprocessor communication. This document presents the design of the Multi-Threaded Architecture (MTA) model, a multiprocessor architecture designed for the efficient parallel execution of both numerical and non-numerical programs. The basic MTA design begins with a conventional processor, and adds what we believe to be minimal external hardware necessary for efficient support of multithreaded programs. The presentation begins with the basic processor design and the program execution model. The latter includes a description of activation frames and thread synchronization. This is followed by a detailed description of the instruction set. Major features of the MTA include the Register-Use Cache for exploiting temporal locality in multiple register set microprocessors more effectively, support for programs requiring nondeterminism and specul...
Communication Studies of DMP and SMP Machines
, 1997
"... Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication -efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields ...
Data and Workload Distribution in a Multithreaded Architecture
, 1996
"... Matching data distribution to workload distribution is important to improve the performance of distributedmemory multiprocessors. While data and workload distribution can be tailored to fit a particular problem to a particular distributed-memory architecture, it is often difficult to do so for vario ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Matching data distribution to workload distribution is important to improve the performance of distributedmemory multiprocessors. While data and workload distribution can be tailored to fit a particular problem to a particular distributed-memory architecture, it is often difficult to do so for various reasons including complexity of address computation, runtime data movement, and irregular resource usage. This report presents our study on multithreading for distributed-memory multiprocessors. Specifically, we investigate the effects of multithreading on data distribution and workload distribution with variable thread granularity. Various types of workload distribution strategies are defined along with thread granularity. Several types of data distribution strategies are investigated. These include row-wise cyclic, k-way partial-row cyclic, and blocked distribution. To investigate the performance of multithreading, two problems are selected: highly sequential Gaussian Elimination with P...
Performance of Shared Caches on Multithreaded Architectures
- Journal of Information Science and Engineering
, 1996
"... A multithreaded computer maintains multiple program counters and register files to support concurrent or overlapped execution of multiple threads of context, and to provide fast context switching for tolerance of memory latency. In this paper, we apply trace-driven simulation to study the perform ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A multithreaded computer maintains multiple program counters and register files to support concurrent or overlapped execution of multiple threads of context, and to provide fast context switching for tolerance of memory latency. In this paper, we apply trace-driven simulation to study the performance impact of a multithreaded architecture on the storage hierarchy. Particularly, we examined the effects of different multithread scheduling techniques on cache performance. Using several program traces representing typical server/workstation workload mix, we found that the cache performance can be improved over the traditional round-robin scheduling method when the thread with the MRU hit is given a higher priority. With a directmapped cache, the absolute hit ratio can be improved by more than 7%. We also studied the performance effects of the multithreading degree, i.e., the number of threads coexisting in the processor at the same time, on cache memory. The results show that bot...
Fine-grain Multi-thread Processor Architecture for Massively Parallel Processing
- Proc. IEEE HPCA'95
, 1995
"... Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient finegrain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results. 1 Introduction In order to develop a very high performance computer, a massively parallel machine...
The Efficient Implementation of Sequential Loops in Multithreaded Computation
- In Proc. of the 7th IASTED-ISMM Int'l Conf. on Parallel and Distributed Computing and Systems
, 1995
"... In multithreaded computers, the per-iteration cycle cost is largely variable according to loop implementation schemes. Especially, when sequential loops with a lot of loop carried dependences in their bodies are unfolded, a number of value movements are required between frames, and a lot of synchron ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In multithreaded computers, the per-iteration cycle cost is largely variable according to loop implementation schemes. Especially, when sequential loops with a lot of loop carried dependences in their bodies are unfolded, a number of value movements are required between frames, and a lot of synchronizations must be performed between threads, causing considerable overhead. However, it has been overlooked how to transform sequential loops efficiently into multithreaded codes. In this paper, we present a scheme to execute sequential loops efficiently in multithreaded computers and compare it with other schemes such as recursion and k-bounded loop by simulation over several benchmarks. Simulation results show that the proposed scheme gives the best performance under the same condition. Keywords: multithreading, loop unfolding, sequential loops, synchronization, communication 1 Introduction Recently, several multithreaded architectures such as *T[1], TAM[2], P-RISC[3], and DAVRID[4] have b...
A Practical Processor Design for Multithreading
- In Proc. of Frontiers '96: The Sixth Symp. on the Frontiers of Massively Parallel Computation
, 1996
"... High speed message handling is one of the most important problems for efficient multithread processing. We have proposed a processor architecture called DatarolII, that promotes the efficient fine-grain multithreaded execution, by performing fast context switching among fine-grain concurrent process ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
High speed message handling is one of the most important problems for efficient multithread processing. We have proposed a processor architecture called DatarolII, that promotes the efficient fine-grain multithreaded execution, by performing fast context switching among fine-grain concurrent processes. We are developing a prototype multithread machine KUMP/D (Kyushu University Multi-media Processor on Datarol-II). The processing element of KUMP/D is designed on the basis of a fine-grain message driven (FMD) execution model, in which fine-grain multithreaded executions are driven and controlled by simple fine-grain message communications. In the design of the KUMP/D, we used the off-the-shelf microprocessor Pentium for its processing element, and designed a co-processor, called FMP (Fine grain Message Processor), for fine grained message handling and communication control. In this paper, we propose the FMD model and introduce the processing element construction in the KUMP/D machine, wh...

