Results 1 - 10
of
24
Monsoon: an explicit token-store architecture
- In Proc. of the 17th Annual Int. Symp. on Comp. Arch
, 1990
"... Dataflow architectures tolerate long unpredictable com-munication delays and support generation and coordi-nation of parallel activities directly in hardware, rather than assuming that program mapping will cause these issues to disappear. However, the proposed mecha-nisms are complex and introduce n ..."
Abstract
-
Cited by 148 (12 self)
- Add to MetaCart
Dataflow architectures tolerate long unpredictable com-munication delays and support generation and coordi-nation of parallel activities directly in hardware, rather than assuming that program mapping will cause these issues to disappear. However, the proposed mecha-nisms are complex and introduce new mapping com-plications. This paper presents a greatly simplified ap-proach to dataflow execution, called the explicit token store (ETS) architecture, and its current realization in Monsoon. The essence of dynamic datallow execution is captured by a simple transition on state bits associ-ated with storage local to a processor. Low-level storage management is performed by the compiler in assigning nodes to slots in an activation frame, rather than dy-namically in hardware. The processor is simple, highly pipelined, and quite general. It may be viewed as a generalization of a fairly primitive von Neumann archi-tecture. Although the addressing capability is restric-tive, there is exactly one instruction executed for each action on the dataflow graph. Thus, the machine ori-ented ETS model provides new understanding of the merits and the real cost of direct execution of dataflow graphs. 1
Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine
- in Proceedings of the Fourth International Conference on Architectural Support for Programming Languages an Operating Systems
, 1991
"... Abstract: In this paper, we present a relatively primitive execution model for ne-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is de ned by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Co ..."
Abstract
-
Cited by 136 (6 self)
- Add to MetaCart
Abstract: In this paper, we present a relatively primitive execution model for ne-grain parallelism, in which all synchronization, scheduling, and storage management is explicit and under compiler control. This is de ned by a threaded abstract machine (TAM) with a multilevel scheduling hierarchy. Considerable temporal locality of logically related threads is demonstrated, providing an avenue for e ective register use under quasi-dynamic scheduling. A prototype TAM instruction set, TL0, has been developed, along with a translator to a variety of existing sequential and parallel machines. Compilation of Id, an extended functional language requiring ne-grain synchronization, under this model yields performance approaching that of conventional languages on current uniprocessors. Measurements suggest that the net cost of synchronization on conventional multiprocessors can be reduced to within a small factor of that on machines with elaborate hardware support, such as proposed data ow architectures. This brings into question whether tolerance to latency and inexpensive synchronization require speci c hardware support or merely an appropriate compilation strategy and program representation. 1
Dynamic Dependency Analysis of Ordinary Programs
- In Proceedings of the 19th Annual International Symposium on Computer Architecture
, 1992
"... A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, a ..."
Abstract
-
Cited by 83 (9 self)
- Add to MetaCart
A quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of program execution, such as operation frequencies, are not sufficient; characterizing the exact nature of dependencies between operations is essential. This paper presents a methodology for constructing the dynamic execution graph that characterizes the execution of an ordinary program (an application program written in an imperative language such as C or FORTRAN) from a serial execution trace of the program. It then uses the methodology to study parallelism in the SPEC benchmarks. We see that the parallelism can be bursty in nature (periods of lots of parallelism followed by periods of little parallelism), but the average parallelism is quite high, ranging from 13 to 23,302 operations per cycle. Exposing this parallelism requires renaming of...
Multithreading: A Revisionist View of Dataflow Architectures
- IN PROCEEDINGS OF THE 18TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1991
"... Although they are powerful intermediate representations for compilers, pure dataflow graphs are incomplete, and perhaps even undesirable, machine languages. They are incomplete because it is hard to encode critical sections and imperative operations which are essential for the efficient execution of ..."
Abstract
-
Cited by 67 (1 self)
- Add to MetaCart
Although they are powerful intermediate representations for compilers, pure dataflow graphs are incomplete, and perhaps even undesirable, machine languages. They are incomplete because it is hard to encode critical sections and imperative operations which are essential for the efficient execution of operating system functions, such as resource management. They may be undesirable because they imply a uniform dynamic scheduling policy for all instructions, preventing a compiler from expressing a static schedule which could result in greater run time efficiency, both by reducing redundant operand synchronization, and by using high speed registers to communicate state between instructions. In this paper, we develop a new machine-level programming model which builds upon two previous improvements to the dataflow execution model: sequential scheduling of instructions, and multiported registers for expression temporaries. Surprisingly, these improvements have required almost no architectural...
Analysis of Multithreaded Architectures for Parallel Computing
"... Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multithreaded processor behavior based on a small set of architectural and program parame ..."
Abstract
-
Cited by 57 (4 self)
- Add to MetaCart
Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multithreaded processor behavior based on a small set of architectural and program parameters. The model gives rise to a large Markov chain, which is solved to obtain a formula for processor efficiency in terms of the number of threads per processor, the remote reference rate, the latency, and the cost of switching between threads. It is shown that a multithreaded processor exhibits three operating regimes: linear (efficiency is proportional to the number of threads), transition, and saturation (efficiency depends only on the remote reference rate and switch cost). Formulae for regime boundaries are derived. The model is embellished to reflect cache degradation due to multithreading, using an analytical model of cache behavior, demonstrating that returns diminish as the number threads becomes large. Predictions from the embellished model correlate well with published empirical measurements. Prescriptive use of the model under various scenarios indicates that multithreading is effective, but the number of useful threads per processor is fairly small.
Compiler-Controlled Multithreading for Lenient Parallel Languages
, 1991
"... Tolerance to communication latency and inexpensive synchronization are critical for general-purpose computing on large multiprocessors. Fast dynamic scheduling is required for powerful non-strict parallel languages. However, machines that support rapid switching between multiple execution threads re ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
Tolerance to communication latency and inexpensive synchronization are critical for general-purpose computing on large multiprocessors. Fast dynamic scheduling is required for powerful non-strict parallel languages. However, machines that support rapid switching between multiple execution threads remain a design challenge. This paper explores how multithreaded execution can be addressed as a compilation problem, to achieve switching rates approaching what hardware mechanisms might provide. Compiler-controlled multithreading is examined through compilation of a lenient parallel language, Id90, for a threaded abstract machine, TAM. A key feature of TAM is that synchronization is explicit and occurs only at the start of a thread, so that a simple cost model can be applied. A scheduling hierarchy allows the compiler to schedule logically related threads closely together in time and to use registers across threads. Remote communication is via message sends and split-phase memory accesses....
Advances in dataflow programming languages
- ACM Comput. Surv
, 2004
"... Abstract. Many developments have taken place within dataflow programming languages in the past decade. In particular, there has been a great deal of activity and advancement in the field of dataflow visual programming languages. The motivation for this article is to review the content of these recen ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
Abstract. Many developments have taken place within dataflow programming languages in the past decade. In particular, there has been a great deal of activity and advancement in the field of dataflow visual programming languages. The motivation for this article is to review the content of these recent developments and how they came
The Impact of Synchronization and Granularity on Parallel Systems
- In Int'l. Symp. on Computer Architecture
, 1990
"... In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ult ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
In this paper, we study the impact of synchronization and granularity on the performance of parallel systems using an execution-driven simulation technique. We find that even though there can be a lot of parallelism at the fine grain level, synchronization and scheduling strategies determine the ultimate performance of the system. Loop-iteration level parallelism seems to be a more appropriate level when those factors are considered. We also study barrier synchronization and data synchronization at the loopiteration level and found both schemes are needed for a better performance.
Multiprocessor Runtime Support for Fine-Grained, Irregular DAGs
- In Rajiv K. Kalia and Priya Vashishta, editors, Toward Teraflop Computing and New Grand Challenge Applications
, 1995
"... We examine multiprocessor runtime support for #ne-grained, irregular directed acyclic graphs #DAGs# such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allowustoachieve unprecedented speedups for a ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
We examine multiprocessor runtime support for #ne-grained, irregular directed acyclic graphs #DAGs# such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allowustoachieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of 34 for 128 processors, resulting in an absolute performance of over 33 million double-precision #oating point operations per second. Weachieve these speedups with non-matrix-speci#c methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and...
Feedback Directed Implicit Parallelism
"... In this paper we present an automated way of using spare CPU resources within a shared memory multi-processor or multi-core machine. Our approach is (i) to profile the execution of a program, (ii) from this to identify pieces of work which are promising sources of parallelism, (iii) recompile the pr ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
In this paper we present an automated way of using spare CPU resources within a shared memory multi-processor or multi-core machine. Our approach is (i) to profile the execution of a program, (ii) from this to identify pieces of work which are promising sources of parallelism, (iii) recompile the program with this work being performed speculatively via a work-stealing system and then (iv) to detect at run-time any attempt to perform operations that would reveal the presence of speculation. We assess the practicality of the approach through an implementation based on GHC 6.6 along with a limit study based on the execution profiles we gathered. We support the full Concurrent Haskell language compiled with traditional optimizations and including I/O operations and synchronization as well as pure computation. We use 20 of the larger programs from the ‘nofib ’ benchmark suite. The limit study shows that programs vary a lot in the parallelism we can identify: some have none, 16 have a potential 2x speed-up, 4 have 32x. In practice, on a 4-core processor, we get 10-80 % speed-ups on 7 programs. This is mainly achieved at the addition of a second core rather than beyond this. This approach is therefore not a replacement for manual parallelization, but rather a way of squeezing extra performance out of the threads of an already-parallel program or out of a program that has not yet been parallelized.

