Results 1  10
of
53
Programmable stream processors
 IEEE Computer
, 2003
"... The Imagine Stream Processor is a singlechip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on singleprecision data and 32 GOPS on 16bit fixedpoint data. The scalability of Imagine’s programming model and architecture enable ..."
Abstract

Cited by 126 (12 self)
 Add to MetaCart
The Imagine Stream Processor is a singlechip programmable media processor with 48 parallel ALUs. At 400 MHz, this translates to a peak arithmetic rate of 16 GFLOPS on singleprecision data and 32 GOPS on 16bit fixedpoint data. The scalability of Imagine’s programming model and architecture enable it to achieve such high arithmetic rates. Imagine executes applications that have been mapped to the stream programming model. The stream model decomposes applications into a set of computation kernels that operate on data streams. This mapping exposes the inherent locality and parallelism in the application, and Imagine exploits the locality and parallelism to provide a scalable architecture that supports 48 ALUs on a single chip. This paper presents the Imagine architecture and programming model in the first half, and explores the scalability of the Imagine architecture in the second half. 1.
Dissemination Of Information In Interconnection Networks (Broadcasting & Gossiping)
, 1996
"... this article follows the aims stated above. The first section introduces this research area. The basic definitions are given and the fundamental, simple observations concerning the relations among the complexity measures defined are carefully explained. This section is ..."
Abstract

Cited by 99 (7 self)
 Add to MetaCart
this article follows the aims stated above. The first section introduces this research area. The basic definitions are given and the fundamental, simple observations concerning the relations among the complexity measures defined are carefully explained. This section is
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replaceadd" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also pr ..."
Abstract

Cited by 89 (2 self)
 Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replaceadd" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replaceadd that permits multiple replaceadds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replaceadds updating the same variable is handled particularly well: If every PE simultaneously addresses a replaceadd at the same variable, all these requests are satisfied in the time required to process just one request.
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures
 IEEE Trans. on Circuits and Systems
, 1990
"... ..."
Eager Sharing for Efficient Massive Parallelism
 In Proceedings of the 1992 International Conference on Parallel Processing
, 1992
"... Workstation networks can become teraFLOPS supercomputers by adding highspeed interfaces supporting selective eager sharing. For Gaussian elimination and fast Fourier transform, selective eager sharing is much more efficient than global sharing of all data changes, and average efficiency remains abov ..."
Abstract

Cited by 29 (9 self)
 Add to MetaCart
Workstation networks can become teraFLOPS supercomputers by adding highspeed interfaces supporting selective eager sharing. For Gaussian elimination and fast Fourier transform, selective eager sharing is much more efficient than global sharing of all data changes, and average efficiency remains above 60% for thousands of processors. Prototype SESAME interfaces will share data at 50 megabytes/second among more than 100 workstations. Propagation delays are typically 0:8 microseconds and overlap computations. All shared data reads are quick local accesses. Eager sharing supports diffuse nonlocal accesses in finegrained parallel programs much more efficiently than demand driven cache protocols. Future massively parallel supercomputers should offer eagersharing coherence mechanisms. Key Words: Distributed Shared Memory, Write Consistency, Eager Sharing, Cache Update, Gaussian Elimination. 1 Introduction Multiprocessors, in which many processors equally share the same memories, limit th...
Design and Performance of a Scalable Parallel Community Climate Model
, 1995
"... . We describe the design of a parallel global atmospheric circulation model, PCCM2. This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multicomputers. PCCM2 incorporates paral ..."
Abstract

Cited by 28 (14 self)
 Add to MetaCart
. We describe the design of a parallel global atmospheric circulation model, PCCM2. This parallel model is functionally equivalent to the National Center for Atmospheric Research's Community Climate Model, CCM2, but is structured to exploit distributed memory multicomputers. PCCM2 incorporates parallel spectral transform, semiLagrangian transport, and load balancing algorithms. We present detailed performance results on the IBM SP2 and Intel Paragon. These results provide insights into the scalability of the individual parallel algorithms and of the parallel model as a whole. 1. Introduction. Computer models of the atmospheric circulation are used both to predict tomorrow's weather and to study the mechanisms of global climate change. Over the last several years, we have studied the numerical methods, algorithms, and programming techniques required to implement these models on socalled massively parallel processing (MPP) computers: that is, computers with hundreds or thousands of pro...
Multidigit Multiplication For Mathematicians
, 2001
"... This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchönhageStrassen ..."
Abstract

Cited by 27 (9 self)
 Add to MetaCart
This paper surveys techniques for multiplying elements of various commutative rings. It covers Karatsuba multiplication, dual Karatsuba multiplication, Toom multiplication, dual Toom multiplication, the FFT trick, the twisted FFT trick, the splitradix FFT trick, Good's trick, the SchönhageStrassen trick, Schönhage's trick, Nussbaumer's trick, the cyclic SchönhageStrassen trick, and the CantorKaltofen theorem. It emphasizes the underlying ring homomorphisms.
The Parallel Scalability of the Spectral Transform Method
 Wea. Rev
, 1992
"... This paper investigates the suitability of the spectral transform method for parallel implementation. The spectral transform method is a natural candidate for general circulation models designed to run on largescale parallel computers due to the large number of existing serial and moderately parall ..."
Abstract

Cited by 22 (8 self)
 Add to MetaCart
This paper investigates the suitability of the spectral transform method for parallel implementation. The spectral transform method is a natural candidate for general circulation models designed to run on largescale parallel computers due to the large number of existing serial and moderately parallel implementations. We present analytic and empirical studies that allow us to quantify the parallel performance, and hence the scalability, of the spectral transform method on different parallel computer architectures. We consider both the shallowwater equations and complete GCMs. Our results indicate that for the shallowwater equations parallel efficiency is generally poor because of high communication requirements. We predict that for complete global climate models, the parallel efficiency will be significantly better; nevertheless, projected Teraflop computers will have difficulty achieving acceptable throughput necessary for longterm regional climate studies. 1 Introduction Current ...
The Genesis Distributed Memory Benchmarks  I Methodology and General Relativity benchmark with results for the SUPRENUM Computer
 Parallel Computing
, 1991
"... this paper, Roger Hockney is responsible for sections 1, 2, 3 and 4.5, and Nigel Bishop for the remaining parts of section 4 on the General Relativity benchmark. Professor Tony Hey is in overall charge of the project. The Genesis benchmarks are offered as open benchmarks to the parallel computing co ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
this paper, Roger Hockney is responsible for sections 1, 2, 3 and 4.5, and Nigel Bishop for the remaining parts of section 4 on the General Relativity benchmark. Professor Tony Hey is in overall charge of the project. The Genesis benchmarks are offered as open benchmarks to the parallel computing community, and will be made available as a series of releases over netlib. Enquiries regarding current availability, or problems with the benchmarks, should be sent by email to `icw@ecs.soton.ac.uk'. In the absence of an accepted programming standard for expressing message passing, there is no way currently of making DMMP programs easily portable across all manufacturers computers. However, our policy is to seek the maximum portability by making available a version of each benchmark using the Argonne/GMD PARMACS communication macros [8], which have now been implemented on a wide range of different computers. In this method, macro statements describing the required communication are expanded into Fortran calls to the native communication primitives by a preprocessor. Currently the PARMACS macro preprocessor is available on the SUPRENUM , Cray YMP, Intel iPSC/2 and iPSC/860, nCUBE2, and Meiko Transputer systems and networks of Suns running CSTools. In addition implementations are planned for the Alliant FX/2800, networks of IBM RS/6000, and PARSYTEC computers. Because it is implemented on top of the native communication libraries of the above computers, PARMACS carries with it an overhead of varying severity. Consequently some benchmarks are also available using the native communication facilities of the computers. These mostly use the SUPRENUM extension of Fortran90 which adds SEND and RECEIVE statements to the language with a syntax similar to the Fortran READ and WRITE, CS...
The CubeConnected Cycles Network Is A Subgraph Of The Butterfly Network
, 1992
"... We prove the following results : (a) The CubeConnected Cycles network of dimension n is a subgraph of the Butterfly network of dimension n. (b) The ShuffleExchange network of dimension n is a subgraph of the DeBruijn network of dimension n. Keywords: networks, embedding 1. Introduction Parallel ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
We prove the following results : (a) The CubeConnected Cycles network of dimension n is a subgraph of the Butterfly network of dimension n. (b) The ShuffleExchange network of dimension n is a subgraph of the DeBruijn network of dimension n. Keywords: networks, embedding 1. Introduction Parallel computer architectures as well as data structures can be described by networks in a very natural manner. One important way to compare these networks is their ability to simulate each other. To describe such a simulation one network is embedded into another one with respect to some cost measures. There are various ways to compare networks [1]. An "optimal" simulation is possible, if one network is a subgraph of the other one. Such embeddings will be provided for four important networks. We show that the CubeConnected Cycles of dimension n is a subgraph of the ndimensional Butterfly network [2, 3] and that the ndimensional ShuffleExchange network is a subgraph of the ndimensional DeBruij...