Results 1 - 10
of
43
A Tightly-Coupled Processor-Network Interface
- In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V
, 1992
"... Careful design of the processor-network interface can dramatically reduce the software overhead of interprocessor communication. Our interface architecture reduces communication overhead five fold in our benchmarks. Most of our performance gain comes from simple, low cost hardware mechanisms for fas ..."
Abstract
-
Cited by 72 (3 self)
- Add to MetaCart
Careful design of the processor-network interface can dramatically reduce the software overhead of interprocessor communication. Our interface architecture reduces communication overhead five fold in our benchmarks. Most of our performance gain comes from simple, low cost hardware mechanisms for fast dispatching on, forwarding of, and replying to messages. The remaining improvement can be gained by implementing the network interface as part of the processor's register file. For example, using our hardware mechanisms a register-mapped interface can receive, process, and reply to a remote read request in a total of two RISC instructions. We have implemented an RTL model of an off-chip memory-mapped interface which provides our hardware mechanisms. Our industrial partner, Motorola, is implementing a similar network interface on-chip in an experimental version of the 88110 processor. 1 Introduction To have a fast parallel computer, the wisdom goes, one needs a fast processor and a fast net...
Smartest Recompilation
- In ACM Symp. on Principles of Programming Languages
, 1993
"... To separately compile a program module in traditional statically-typed languages, one has to manually write down an import interface which explicitly specifies all the external symbols referenced in the module. Whenever the definitions of these external symbols are changed, the module has to be reco ..."
Abstract
-
Cited by 60 (3 self)
- Add to MetaCart
To separately compile a program module in traditional statically-typed languages, one has to manually write down an import interface which explicitly specifies all the external symbols referenced in the module. Whenever the definitions of these external symbols are changed, the module has to be recompiled. In this paper, we present an algorithm which can automatically infer the "minimum" import interface for any module in languages based on the Damas-Milner type discipline (e.g., ML). By "minimum", we mean that the interface specifies a set of assumptions (for external symbols) that are just enough to make the module type-check and compile. By compiling each module using its "minimum" import interface, we get a separate compilation method that can achieve the following optimal property: A compilation unit never needs to be recompiled unless its own implementation changes.
Compiler-Controlled Multithreading for Lenient Parallel Languages
, 1991
"... Tolerance to communication latency and inexpensive synchronization are critical for general-purpose computing on large multiprocessors. Fast dynamic scheduling is required for powerful non-strict parallel languages. However, machines that support rapid switching between multiple execution threads re ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
Tolerance to communication latency and inexpensive synchronization are critical for general-purpose computing on large multiprocessors. Fast dynamic scheduling is required for powerful non-strict parallel languages. However, machines that support rapid switching between multiple execution threads remain a design challenge. This paper explores how multithreaded execution can be addressed as a compilation problem, to achieve switching rates approaching what hardware mechanisms might provide. Compiler-controlled multithreading is examined through compilation of a lenient parallel language, Id90, for a threaded abstract machine, TAM. A key feature of TAM is that synchronization is explicit and occurs only at the start of a thread, so that a simple cost model can be applied. A scheduling hierarchy allows the compiler to schedule logically related threads closely together in time and to use registers across threads. Remote communication is via message sends and split-phase memory accesses....
The Design, Implementation and Evaluation of Jade, a Portable, Implicitly Parallel Programming Language
- Dept. of Computer Science, Stanford Univ
, 1994
"... ii ..."
Properties of a First-order Functional Language with Sharing
- Theoretical Computer Science
, 1994
"... A calculus and a model for a first-order functional language with sharing is presented. In most implementations of... ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
A calculus and a model for a first-order functional language with sharing is presented. In most implementations of...
Order-of-evaluation Analysis for Destructive Updates in Strict Functional Languages with Flat Aggregates
- In Conference on Functional Programming Languages and Computer Architecture
, 1993
"... The aggregate update problem in functional languages is concerned with detecting cases where a functional array update operation can be implemented destructively in constant time. Previous work on this problem has assumed a fixed order of evaluation of expressions. In this paper, we devise a simple ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
The aggregate update problem in functional languages is concerned with detecting cases where a functional array update operation can be implemented destructively in constant time. Previous work on this problem has assumed a fixed order of evaluation of expressions. In this paper, we devise a simple analysis, for strict functional languages with flat aggregates, that derives a good order of evaluation for making the updates destructive. Our work improves Hudak's work [14] on abstract reference counting, which assumes fixed order of evaluation and uses the domain of sticky reference counts. Our abstract reference counting uses a 2-point domain. We show that for programs with no aliasing, our analysis is provably more precise than Hudak's approach (even if the fixed order of evaluation chosen by Hudak happens to be the right order). We also show that our analysis algorithm runs in polynomial time. To the best of our knowledge, no previous work shows polynomial time complexity. We suggest ...
A Syntactic Approach to Program Transformations
, 1991
"... Kid, a language for expressing compiler optimizations for functional languages is introduced. The language is λ-calculus based but treats let-blocks as first class objects. Let-blocks and associated rewrite rules provide the basis to capture the sharing of subexpressions precisely. The langua ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Kid, a language for expressing compiler optimizations for functional languages is introduced. The language is λ-calculus based but treats let-blocks as first class objects. Let-blocks and associated rewrite rules provide the basis to capture the sharing of subexpressions precisely. The language goes beyond λ-calculus by including I-structures which are essential to express efficient translations of list and array comprehensions. A calculus and a parallel interpreter for Kid are developed. Many commonly known program transformations are also presented. A partial evaluator for Kid is developed and a notion of correctness of Kid transformations based on the syntactic structure of terms and printable answers is presented.
Generation and Quantitative Evaluation of Dataflow Clusters
, 1993
"... Multithreaded or hybrid von Neumann/dataflow execution models have an advantage over the fine-grain dataflow model in that they significantly reduce the run time overhead incurred by matching. In this paper, we look at two issues related to the evaluation of a coarse-grain dataflow model of executio ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Multithreaded or hybrid von Neumann/dataflow execution models have an advantage over the fine-grain dataflow model in that they significantly reduce the run time overhead incurred by matching. In this paper, we look at two issues related to the evaluation of a coarse-grain dataflow model of execution. The first issue concerns the compilation into a coarsegrain code from a fine-grain one. In this study, the concept of coarse-grain code is captured by clusters which can be thought of as mini-dataflow graphs which execute strictly, deterministically and without blocking. We look at two bottom-up algorithms: the basic block and the dependence sets methods, to partition dataflow graphs into clusters. The second issue is the actual performance of the clusterbased execution as several architecture parameters are varied (e.g. number of processors, matching cost, network latency, etc.). From the extensive simulation data we evaluate (1) the potential speedup over the fine-grain execution and (2...
Code Generations, Evaluations, and Optimizations in Multithreaded Executions
, 1995
"... OF DISSERTATION CODE GENERATIONS, EVALUATIONS, AND OPTIMIZATIONS IN MULTITHREADED EXECUTIONS Efficient large-scale parallel processing can result only from proper handling of latency. Latency arises either from remote memory accesses or synchronizations. Multithreading is an execution model that can ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
OF DISSERTATION CODE GENERATIONS, EVALUATIONS, AND OPTIMIZATIONS IN MULTITHREADED EXECUTIONS Efficient large-scale parallel processing can result only from proper handling of latency. Latency arises either from remote memory accesses or synchronizations. Multithreading is an execution model that can effectively deal with latency by switching among a set of ready threads. This model has been proposed in a variety of forms: a unit of storage can be based on either a collection of threads or a single thread, threads can be either blocking or non-blocking, and synchronization can be either implicit or explicit. This dissertation describes research in the evaluation and optimization of various issues in multithreading. Issues of particular interest are the development of a multithreaded execution model to be used as a test-bed and a hybrid code generation scheme where threads are generated in a top-down manner and then optimized in a bottom-up fashion. Various forms of locality are also ide...

