Results 1  10
of
75
Programmable stream processors
 IEEE Computer
, 2003
"... The Imagine Stream Processor is a singlechip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on singleprecision data and 32 GOPS on 16bit fixedpoint data. The scalability of Imagine’s programming model and architecture enable ..."
Abstract

Cited by 134 (12 self)
 Add to MetaCart
The Imagine Stream Processor is a singlechip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on singleprecision data and 32 GOPS on 16bit fixedpoint data. The scalability of Imagine’s programming model and architecture enable it to achieve such high arithmetic rates. Imagine executes applications that have been mapped to the stream programming model. The stream model decomposes applications into a set of computation kernels that operate on data streams. This mapping exposes the inherent locality and parallelism in the application, and Imagine exploits the locality and parallelism to provide a scalable architecture that supports 48 ALUs on a single chip. This paper presents the Imagine architecture and programming model in the first half, and explores the scalability of the Imagine architecture in the second half. 1.
Dissemination Of Information In Interconnection Networks (Broadcasting & Gossiping)
, 1996
"... this article follows the aims stated above. The first section introduces this research area. The basic definitions are given and the fundamental, simple observations concerning the relations among the complexity measures defined are carefully explained. This section is ..."
Abstract

Cited by 101 (7 self)
 Add to MetaCart
this article follows the aims stated above. The first section introduces this research area. The basic definitions are given and the fundamental, simple observations concerning the relations among the complexity measures defined are carefully explained. This section is
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replaceadd" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential pr ..."
Abstract

Cited by 88 (2 self)
 Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replaceadd" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replaceadd that permits multiple replaceadds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replaceadds updating the same variable is handled particularly well: If every PE simultaneously addresses a replaceadd at the same variable, all these requests are satisfied in the time required to process just one request.
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures
 IEEE Trans. on Circuits and Systems
, 1990
"... ..."
(Show Context)
Solution of Partial Differential Equations on Vector Computers
 Proc. 1977 Army Numerical Analysis and Computers Conference
, 1977
"... In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with particular attention paid to those characteristics t ..."
Abstract

Cited by 53 (0 self)
 Add to MetaCart
In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selecUon. Both direct and iteraUve methods are given for elliptic equations as well as explicit and implicit methods for initialboundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restricUons or the lack of adequate algorithms. A brief discussion of application areas utilizing these computers is included.
Design and performance of a scalable parallel community climate model, Parallel Computing 21
, 1995
"... We describe the design of a parallel global atmospheric circulation model, PCCM2. This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multicomputers. PCCM2 incorporates p ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
(Show Context)
We describe the design of a parallel global atmospheric circulation model, PCCM2. This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multicomputers. PCCM2 incorporates parallel spectral transform, semiLagrangian transport, and load balancing algorithms. We present detailed performance results on the IBM SP2 and Intel Paragon. These results provide insights into the scalability of the individual parallel algorithms and of the parallel model as a whole. 1
Multidigit Multiplication For Mathematicians
, 2001
"... This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchönhageStr ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchönhageStrassen trick, Schönhage's trick, Nussbaumer's trick, the cyclic SchönhageStrassen trick, and the CantorKaltofen theorem. It emphasizes the underlying ring homomorphisms.
Eager Sharing for Efficient Massive Parallelism
 In Proceedings of the 1992 International Conference on Parallel Processing
, 1992
"... Workstation networks can become teraFLOPS supercomputers by adding highspeed interfaces supporting selective eager sharing. For Gaussian elimination and fast Fourier transform, selective eager sharing is much more efficient than global sharing of all data changes, and average efficiency remains abov ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
(Show Context)
Workstation networks can become teraFLOPS supercomputers by adding highspeed interfaces supporting selective eager sharing. For Gaussian elimination and fast Fourier transform, selective eager sharing is much more efficient than global sharing of all data changes, and average efficiency remains above 60% for thousands of processors. Prototype SESAME interfaces will share data at 50 megabytes/second among more than 100 workstations. Propagation delays are typically 0:8 microseconds and overlap computations. All shared data reads are quick local accesses. Eager sharing supports diffuse nonlocal accesses in finegrained parallel programs much more efficiently than demand driven cache protocols. Future massively parallel supercomputers should offer eagersharing coherence mechanisms. Key Words: Distributed Shared Memory, Write Consistency, Eager Sharing, Cache Update, Gaussian Elimination. 1 Introduction Multiprocessors, in which many processors equally share the same memories, limit th...
The parallel scalability of the spectral transform method
, 1992
"... This paper investigates the suitability of the spectral transform method for parallel implementation. The spectral transform method is a natural candidate for general circulation models designed to run on largescale parallel computers due to the large number of existing serial and moderately parall ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
(Show Context)
This paper investigates the suitability of the spectral transform method for parallel implementation. The spectral transform method is a natural candidate for general circulation models designed to run on largescale parallel computers due to the large number of existing serial and moderately parallel implementations. We present analytic and empirical studies that allow us to quantify the parallel performance, and hence the scalability, of the spectral transform method on dierent parallel computer architectures. We consider both the shallowwater equations and complete GCMs. Our results indicate that for the shallowwater equations parallel eciency is generally poor because of high communication requirements. We predict that for complete global climate models, the parallel eciency will be signi cantly better; nevertheless, projected Tera
op computers will have diculty achieving acceptable throughput necessary for longterm regional climate studies. 1
The Genesis Distributed Memory Benchmarks  I Methodology and General Relativity benchmark with results for the SUPRENUM Computer
 Parallel Computing
, 1991
"... this paper, Roger Hockney is responsible for sections 1, 2, 3 and 4.5, and Nigel Bishop for the remaining parts of section 4 on the General Relativity benchmark. Professor Tony Hey is in overall charge of the project. The Genesis benchmarks are offered as open benchmarks to the parallel computing co ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
this paper, Roger Hockney is responsible for sections 1, 2, 3 and 4.5, and Nigel Bishop for the remaining parts of section 4 on the General Relativity benchmark. Professor Tony Hey is in overall charge of the project. The Genesis benchmarks are offered as open benchmarks to the parallel computing community, and will be made available as a series of releases over netlib. Enquiries regarding current availability, or problems with the benchmarks, should be sent by email to `icw@ecs.soton.ac.uk'. In the absence of an accepted programming standard for expressing message passing, there is no way currently of making DMMP programs easily portable across all manufacturers computers. However, our policy is to seek the maximum portability by making available a version of each benchmark using the Argonne/GMD PARMACS communication macros [8], which have now been implemented on a wide range of different computers. In this method, macro statements describing the required communication are expanded into Fortran calls to the native communication primitives by a preprocessor. Currently the PARMACS macro preprocessor is available on the SUPRENUM , Cray YMP, Intel iPSC/2 and iPSC/860, nCUBE2, and Meiko Transputer systems and networks of Suns running CSTools. In addition implementations are planned for the Alliant FX/2800, networks of IBM RS/6000, and PARSYTEC computers. Because it is implemented on top of the native communication libraries of the above computers, PARMACS carries with it an overhead of varying severity. Consequently some benchmarks are also available using the native communication facilities of the computers. These mostly use the SUPRENUM extension of Fortran90 which adds SEND and RECEIVE statements to the language with a syntax similar to the Fortran READ and WRITE, CS...