Results 1 - 10
of
23
Performance Tradeoffs In Multithreaded Processors
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1991
"... ... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors th ..."
Abstract
-
Cited by 111 (5 self)
- Add to MetaCart
... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects. The model is validated through our own simulations and by comparison with previously published simulation results. Our results indicate that processors can substantially benefit from multithreading, even in systems with small caches. Large caches yield close to full processor utilization with as few as two to four contexts, while small caches may require up to four times as many contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization.
StarT the Next Generation: Integrating Global Caches and Dataflow Architecture
- CSG MEMO 354, COMPUTATION STRUCTURES GROUP, MIT LAB. FOR COMP. SCI
, 1994
"... The implicitly parallel programming model provides an attractive approach to deal with the complexity of parallel programming. Implementing this model efficiently, especially on stock processors, remains a big challenge, partly because of the fine granularity of the parallelism exploited. The Monsoo ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
The implicitly parallel programming model provides an attractive approach to deal with the complexity of parallel programming. Implementing this model efficiently, especially on stock processors, remains a big challenge, partly because of the fine granularity of the parallelism exploited. The Monsoon[27] project was designed to address and investigate support for fine-grain parallelism, and has yielded very encouraging results[13]. Our experience with Monsoon and *T[24, 28], a followup project after Monsoon, suggests that provision for global shared memory is an area where both the Monsoon and *T architectures can be improved. Starting with the split-phase approach used in Monsoon and *T, we propose to augment global memory access by including coherent global caches. The rapid improvements in stock microprocessors, and the high cost and effort required to develop a competitive microprocessor, presents practical constraints on what can be built in any experimental architecture project. ...
Code Generations, Evaluations, and Optimizations in Multithreaded Executions
, 1995
"... OF DISSERTATION CODE GENERATIONS, EVALUATIONS, AND OPTIMIZATIONS IN MULTITHREADED EXECUTIONS Efficient large-scale parallel processing can result only from proper handling of latency. Latency arises either from remote memory accesses or synchronizations. Multithreading is an execution model that can ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
OF DISSERTATION CODE GENERATIONS, EVALUATIONS, AND OPTIMIZATIONS IN MULTITHREADED EXECUTIONS Efficient large-scale parallel processing can result only from proper handling of latency. Latency arises either from remote memory accesses or synchronizations. Multithreading is an execution model that can effectively deal with latency by switching among a set of ready threads. This model has been proposed in a variety of forms: a unit of storage can be based on either a collection of threads or a single thread, threads can be either blocking or non-blocking, and synchronization can be either implicit or explicit. This dissertation describes research in the evaluation and optimization of various issues in multithreading. Issues of particular interest are the development of a multithreaded execution model to be used as a test-bed and a hybrid code generation scheme where threads are generated in a top-down manner and then optimized in a bottom-up fashion. Various forms of locality are also ide...
Adding Fast Interrupts to Superscalar Processors
- Tech. Rep. Memo-366, MIT Computation Structures Group
, 1994
"... The hardware cost of taking an interrupt is increasing as processors become more superscalar. Using FLIP, an aggressively superscalar processor which we have designed and tested in Verilog, we demonstrate that interrupts can be fast and inexpensive. We trace individual signals through FLIP's pipelin ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The hardware cost of taking an interrupt is increasing as processors become more superscalar. Using FLIP, an aggressively superscalar processor which we have designed and tested in Verilog, we demonstrate that interrupts can be fast and inexpensive. We trace individual signals through FLIP's pipeline stages to show that fast interrupts require negligible new hardware. Except for linkage information, interrupts reuse existing branch mechanisms. An asynchronous interrupt acts as an immediate jump instruction, while a synchronous interrupt acts as a mispredicted branch. Although we concentrate on user-level interrupts, we show that kernel-level interrupts can be handled identically with the addition of protection mode bits to identify the protection mode of every outstanding instruction. In blending fast interrupts into the superscalar processor, we address two new problems. The first problem arises from fast synchronous interrupts. Because most instructions can cause an interrupt, the pr...
Midc Language Manual
, 1996
"... Contents 1 Introduction 3 2 Execution Model 3 3 MIDC 1 Description 4 4 MIDC 2 Description 7 5 MIDC 3 Description 7 6 Data Types and Memory Management 8 7 MIDC 1 Operators 9 8 MIDC 3 Operators 12 9 Pragmas 13 9.1 Node Pragmas : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Contents 1 Introduction 3 2 Execution Model 3 3 MIDC 1 Description 4 4 MIDC 2 Description 7 5 MIDC 3 Description 7 6 Data Types and Memory Management 8 7 MIDC 1 Operators 9 8 MIDC 3 Operators 12 9 Pragmas 13 9.1 Node Pragmas : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 9.2 Instruction Pragmas : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 14 10 Examples 15 10.1 Binary Integration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15 10.2 Vector Dot Product : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 19 11 Generating MIDC 1 22 12 Generating MIDC 2 (Optimizer) 24 13 Generating MIDC 3 (Frame Code) 26 1 Introduction With the proliferation of multithreaded architectures ranging from the Tera [ACC +
Efficient Implementation of Sequential Loops in Dataflow Computation
- In Proceedings of the 6th Conference on Functional Programming and Computer Architecture
, 1993
"... The implementation of sequential loops in dataflow computation had traditionally not received very much attention as it was assumed that most loops would be executed in parallel. This assumption was valid for earlier dataflow machines such as the MIT Tagged Token Dataflow Architecture (TTDA)[2], Sig ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
The implementation of sequential loops in dataflow computation had traditionally not received very much attention as it was assumed that most loops would be executed in parallel. This assumption was valid for earlier dataflow machines such as the MIT Tagged Token Dataflow Architecture (TTDA)[2], Sigma-1[9] but not for the newest generation of dataflow machines including Monsoon[6], EM-4[11] and Epsilon-2[7]. On the latter machines, sequential loops use less memory, and can execute in fewer instructions, albeit with lower parallelism than the parallel versions. This characterisation of sequential and parallel loops suggests that programs should have parallel outer loops and sequential inner loops. The run time of sequential loops therefore become significant in the overall run time. We also found that previous implementations of sequential loops can incur fairly high overheads. In this paper, we present two new ways of implementing sequential loops that have lower overhead then previous...
The DUDE Runtime System: An Object-Oriented Macro-Dataflow Approach To Integrated Task and Object Parallelism
, 1995
"... Modern parallel programming languages allow programmers to specify parallelism using implicitly parallel constructs such as data parallel or object parallel methods, and explicitly parallel constructs, such as doall, doacross, parallel section or programmer-level threads. In this paper, we present t ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Modern parallel programming languages allow programmers to specify parallelism using implicitly parallel constructs such as data parallel or object parallel methods, and explicitly parallel constructs, such as doall, doacross, parallel section or programmer-level threads. In this paper, we present the design of a runtime system that executes data-parallel (or objectparallel) code in the presence of explicit parallelism. This facilitates load balancing between data-parallel computations running in threads of distinct parallel sections, as well as inter-loop load balancing. Although sufficient runtime structure is provided for most extant languages, the runtime system is extensible, allowing compilers to customize the runtime system. To motivate why such a runtime system is desirable, we use show performance improvements for programs with complex data dependence relations, such as multigrid solvers. 1 Introduction Most efforts on simplifying or improving parallel programs has focused ei...
The Design of I-Structure Software Cache System
- In Workshop on Multithreaded Execution, Architecture and Compilation
, 1998
"... I-Structure memory system has been adopted in some non-blocking multithreaded systems for its ability to hide the latency of accessing I-Structure memory from useful computation by split-phased transaction mechanisms. However, a major drawback of splitphased transactions is that data locality of rem ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
I-Structure memory system has been adopted in some non-blocking multithreaded systems for its ability to hide the latency of accessing I-Structure memory from useful computation by split-phased transaction mechanisms. However, a major drawback of splitphased transactions is that data locality of remote data is not utilized by a conventional on-board cache system. In this paper, we describe the design of our IStructure Software Cache (ISSC) which takes advantage of data locality while maintaining the capability for latency tolerance of I-Structure memory systems. Keywords: Multithreaded architecture, Distributed memory system, I-Structure cache, Split-Phase transaction, ISSC. 1 Introduction A Split-phased transaction [8, 14] is an asynchronous memory access scheme used in some message-passing multiprocessor systems. Remote memory requests are structured into two phases so that multiple requests may be in progress at the same time: an instruction issues a request to the processor or mem...

