## Programmable Active Memories: Reconfigurable Systems Come of Age (1996)

Venue: | IEEE Transactions on VLSI Systems |

Citations: | 143 - 5 self |

### BibTeX

@ARTICLE{Vuillemin96programmableactive,

author = {J. Vuillemin and P. Bertin and D. Roncin and M. Shand and H. Touati and P. Boucard},

title = {Programmable Active Memories: Reconfigurable Systems Come of Age},

journal = {IEEE Transactions on VLSI Systems},

year = {1996},

volume = {4},

pages = {56--69}

}

### Years of Citing Articles

### OpenURL

### Abstract

Programmable Active Memories (PAM) are a novel form of universal reconfigurable hardware co-processor. Based on Field-Programmable Gate Array (FPGA) technology, a PAM is a virtual machine, controlled by a standard microprocessor, which can be dynamically and indefinitely reconfigured into a large number of application-specific circuits. PAMs offer a new mixture of hardware performance and software versatility. We review the important architectural features of PAMs, through the example of DECPeRLe-1, an experimental device built in 1992. PAM programming is presented, in contrast to classical gate-array and full custom circuit design. Our emphasis is on large, code-generated synchronous systems descriptions

### Citations

4431 |
Computer Architecture: A Quantitative Approach
- Hennessy, Patterson
- 2007
(Show Context)
Citation Context ...quivalent to multiplication. Due to the great variety of the operations required by each application, quantitative performance comparison between different computer architectures is a challenging art =-=[50]-=-. The million of instructions per second (MIPS) and million of floating-point operations per second (MFLOPS) are more traditional units for measuring computing power. By our definition, a 32-bit stand... |

3303 | A method for obtaining digital signatures and public-key cryptosystems
- Rivest, Shamir, et al.
(Show Context)
Citation Context ... conventional multipliers, even for short 32-bit operands. B. RSA cryptography To further investigate the tradeoffs in our hybrid hardware and software system, we have focused on the RSA cryptosystem =-=[23]-=-. Both encryption and decryption involve computing modular exponentials, which can be decomposed as sequences of long modular multiplications, with operand sizes ranging from 256 bits to 1k bits. Star... |

108 | Programmable Active Memories: a Performance Assessment
- Bertin, Roncin, et al.
- 1993
(Show Context)
Citation Context ...onvolution codes, by recompiling a new P 1 configuration on-the-fly for each code. VI. The computing power of PAM Let us now quantify the computing power of a PAM processor. Following earlier reports =-=[48]-=- [7], we define the virtual computing power of a PAM with n PABs which operate at f Hertz as the product P = n \Theta f . The resulting power P is expressed in boolean operations per second (BOPS). Fo... |

89 | Fast implementations of RSA cryptography
- Shand, Vuillemin
- 1993
(Show Context)
Citation Context ...ers 4 Precompute powers 1:25 Hensel's division 1:5 Carry completions2 Quotient pipelining 4 The resulting P 1 design for RSA cryptography combines all of the above techniques (see Shand and Vuillemin =-=[25]-=- for details). It delivers an RSA secret decryption rate in excess of 600 kb/s for 512-bit keys, and 165 kb/s for 1kbit keys. This is an order of magnitude faster than any previously reported running ... |

47 |
P-NAC: A Systolic Array for Comparing Nucleic Acid Sequences
- Lopresti
(Show Context)
Citation Context ...T using 12 Transputers; it has only half of the performance obtained by a system previously developed at IRISA based on 28 custom VLSI chips and two printed-circuit boards. The DNA matching algorithm =-=[26]-=- is the driving application for the PAM developed at the Supercomputing Research Center in Maryland [12]: the reported performance is, here again, in excess of that obtained with existing supercompute... |

36 |
Two's Complement Pipeline Multipliers
- Lyon
- 1976
(Show Context)
Citation Context ...cation PAMs may be configured as long integer multipliers [18]. They compute the product P = A \Theta B + S where A is an n-bit long multiplier, and B; S are arbitrary size multiplicands and summands =-=[19]-=-; n may be up to 2k for the P 1 implementation. Our multipliers are interfaced with the public domain arbitrary-precision arithmetic package BigNum [20]: programs based on that software automatically ... |

32 | Bignum: a portable efficient package for arbitrary-precision arithmetic
- Hervé, Serpette, et al.
- 1989
(Show Context)
Citation Context ...S are arbitrary size multiplicands and summands [19]; n may be up to 2k for the P 1 implementation. Our multipliers are interfaced with the public domain arbitrary-precision arithmetic package BigNum =-=[20]-=-: programs based on that software automatically benefit from the PAM, by simply linking with an appropriatedly modified BigNum library. P 1 computes product bits at 66 Mb/s (using radix 4 operations a... |

21 | Hardware Speedups in Long Integer Multiplication
- Shand, Bertin, et al.
- 1991
(Show Context)
Citation Context ...metic A reg. P s/p reg. Host Data 32 512 / 2k x Mult. Slice 32 2 2 2 2 Host Addr. A p/s reg. B p/s reg. S p/s reg. Cntr. Fig. 7. Long multiplication PAMs may be configured as long integer multipliers =-=[18]-=-. They compute the product P = A \Theta B + S where A is an n-bit long multiplier, and B; S are arbitrary size multiplicands and summands [19]; n may be up to 2k for the P 1 implementation. Our multip... |

16 |
A multiprecise integer arithmetic package
- Buell, Ward
- 1989
(Show Context)
Citation Context ...rary. P 1 computes product bits at 66 Mb/s (using radix 4 operations at 33 MHz), which is faster than all previously published benchmarks. This is 16 times over the figures reported by Buell and Ward =-=[21]-=- for the Cray II and Cyber 170/750. P 1 's multiplier can compute a 50-coefficient 16-bit polynomial convolution (FIR filter) at 16 times audio real time (2 \Theta 24-bit samples at 48 kHz). The first... |

16 |
A combinatorial limit to the computing power of VLSI circuits
- Vuillemin
- 1983
(Show Context)
Citation Context ...d logical operations are bit-wise equivalent to addition. \Theta One (n \Theta m 7! n + m)-bit multiplication each nanosecond is worth nm GBOPS. Division, integer shifts and transitive (see Vuillemin =-=[49]-=-) bit permutations are bit-wise equivalent to multiplication. Due to the great variety of the operations required by each application, quantitative performance comparison between different computer ar... |

13 |
A User Programmable Reconfigurable Logic Array
- Carter, Duong, et al.
- 1986
(Show Context)
Citation Context ...l times. We assess, before the conclusion, the computing power of PAM technology, for the present and the future times. II. Virtual circuits The first commercial FPGA was introduced in 1986 by Xilinx =-=[1]-=-. This revolutionary component has a large internal configuration memory, and two modes of operation: in download mode, the configuration memory can be written, as a whole, through some external devic... |

13 | A hardware emulator for binary neural networks
- Skubiszewski
- 1990
(Show Context)
Citation Context ...retization grids in order to rapidly advance in simulated time, then "zooms back in" to full resolution, in order to accurately smooth out the desired final result. E. Neural networks M. Sku=-=biszewski [33]-=- [34] has implemented a hardware emulator for binary neural networks, based on the Boltzmann machine model. The Boltzmann machine is a probabilistic algorithm which minimizes quadratic forms over bina... |

13 | High energy physics on DECPeRLe-1 programmable active memory
- Moll, Vuillemin, et al.
- 1995
(Show Context)
Citation Context ... quantitative analysis of the computing power required for this task: the PAM is the only structure found to meet this bound. This algorithm was implemented by P. Boucard and J. Vuillemin on P 1 [37] =-=[38]-=-. Using the external I/O capabilities described in section III-C, data is input from the detectors through two off-the-shelf HIPPI-to-TURBOchannel interface boards plugged directly onto P 1 . The data... |

12 |
Mémoires actives programmables: conception, réalisation et programmation. (Programmable Active Memories
- Bertin
- 1993
(Show Context)
Citation Context ...ic matching algorithms. More systems exist than just the ones mentioned here. A thorough presentation of the issues involved in PAM design, with alternative implementation choices, is given by Bertin =-=[16]-=-. IV. PAM programming A PAM program consists of three parts: 1. The driving software, which runs on the host and controls the PAM hardware. 2. The logic equations describing the synchronous hardware i... |

11 |
The programmable Gate Array Data Book” Xilinx
- Xilinx
- 1992
(Show Context)
Citation Context ...he FPGA is a universal such structure: any synchronous digital circuit can be emulated, through a suitable configuration, on a large enough FPGA, for a slow enough clock. Some vendors, such as Xilinx =-=[2]-=- or AT&T [3], form their PABs from both configurable routing and logic blocks. Other early ones, such as Algotronix [4] (now with Xilinx) or Concurrent Logic [5] (now with Atmel), combine routing and ... |

11 | Measuring system performance with reprogrammable hardware
- Shand
- 1992
(Show Context)
Citation Context ...atsanevas et al. [40] for details). H. Image acquisition P 1 's TURBOchannel adapter (see section III-C), being built around a single XC3090, is a PAM in its own right--- albeit a small one. M. Shand =-=[41]-=- describes a number of experiments based on this board, including an interface 3 In high-energy physics terminology, this is the first level trigger. VUILLEMIN ET AL.: PROGRAMMABLE ACTIVE MEMORIES, RE... |

10 |
Multigrid methods on parallel computers-a survey of recent developments, Impact Comput
- McBryan, Frederickon, et al.
- 1991
(Show Context)
Citation Context ... same results as floating-point operations. The performance achieved by this first 24-bit P 1 design thus exceeds those reported by RAM RAM ... Fig. 10. Heat and Laplace equations McBryan et al. [30] =-=[31]-=-, for solving the same problems with the help of supercomputers. A sequential computer must execute 20 billion instructions per second in order to reproduce the same computation. S. Hadinger and P. Ra... |

10 | An exact hardware implementation of the Boltzmann machine
- Skubiszewski
- 1992
(Show Context)
Citation Context ...ation grids in order to rapidly advance in simulated time, then "zooms back in" to full resolution, in order to accurately smooth out the desired final result. E. Neural networks M. Skubisze=-=wski [33] [34]-=- has implemented a hardware emulator for binary neural networks, based on the Boltzmann machine model. The Boltzmann machine is a probabilistic algorithm which minimizes quadratic forms over binary va... |

9 |
Vuillemin: Introduction to Programmable Active Memories
- Bertin, Roncin, et al.
- 1989
(Show Context)
Citation Context ...ycles. Tricky, but doable. E. Other Reconfigurable Systems Besides our PAMs, which were built first at INRIA in 1987 up to Perle-0, whose architecture is described in some detail in an earlier report =-=[9]-=-, then at DEC-PRL, other successful implementations of reconfigurable systems have been reported, in particular at the universities of Edinburgh [10] and Zurich [11], and at the Supercomputer Research... |

9 |
A gate array for systolic computing
- Furtek
- 1993
(Show Context)
Citation Context ...; yet, the MPEG decoder still requires about as much hardware as the following DCT 3D . A detailed FPGA mapping of the motion estimation algorithm---the core of the MPEG standard---is given by Furtek =-=[35]-=-. Mapping this fully laid-out design onto P 1 would be a straightforward task. DCT 3D J. Vuillemin, D. Martineau, and J. Barraquand from PRL have used P 1 to experiment with DCT 3D , a 3-D version of ... |

8 | Fast linear hough transform
- Vuillemin
- 1994
(Show Context)
Citation Context ...ifferent slopes crossing a 128 \Theta 96 grid at the required 100 kHz rate, with a latency of 2 images (20 s). It needs more than twice the computing power of P 1 to achieve this result. J. Vuillemin =-=[39]-=- describes an O(N 2 log N ) algorithm to compute the Hough transform, in a recursive way analogous to the Fast Fourier Transform (figure 14). The resulting gain in the processing power needed by the c... |

7 |
ENABLE---A Systolic 2nd Level Trigger Processor for Track Finding and e/pi Discrimination for ATLAS/LHC
- Klefenz, Noffz, et al.
- 1993
(Show Context)
Citation Context ...urgh [10] and Zurich [11], and at the Supercomputer Research Center in Maryland [12]. The ENABLE machine is a system, built from FPGAs and SRAM, specifically constructed at the university of Mannheim =-=[13]-=- for solving the TRT problem of section VG. 2. Many similar application-specific have been built in the recent years: the reconfigurable nature is only exploited while developping and debugging the ap... |

6 |
Evaluating parallel architectures for two real-time applications with 100 kHz repetition rate
- Busson, Charlot, et al.
- 1993
(Show Context)
Citation Context ...ftware, we rate this algorithm, which requires a lot of data movement, at 15 GIPS. G. High-energy physics G.1 Image classification The calorimeter is part of a series of benchmarks proposed by CERN 2 =-=[36]-=-. The goal is to measure the performance of various computer architectures, in order to build the electronics required for the Large Hadron Collider (LHC), before the turn of the millennium. The calor... |

5 |
A new architecture for high-performance FPGAs
- Hill, Britton, et al.
- 1992
(Show Context)
Citation Context ... universal such structure: any synchronous digital circuit can be emulated, through a suitable configuration, on a large enough FPGA, for a slow enough clock. Some vendors, such as Xilinx [2] or AT&T =-=[3]-=-, form their PABs from both configurable routing and logic blocks. Other early ones, such as Algotronix [4] (now with Xilinx) or Concurrent Logic [5] (now with Atmel), combine routing and computing fu... |

5 |
The Use of FPGA’s in a Novel Computing Subsystem
- Kean, Buchanan
- 1992
(Show Context)
Citation Context ...re is described in some detail in an earlier report [9], then at DEC-PRL, other successful implementations of reconfigurable systems have been reported, in particular at the universities of Edinburgh =-=[10]-=- and Zurich [11], and at the Supercomputer Research Center in Maryland [12]. The ENABLE machine is a system, built from FPGAs and SRAM, specifically constructed at the university of Mannheim [13] for ... |

5 |
Connection Machine application performance
- McBryan
(Show Context)
Citation Context ...s the same results as floating-point operations. The performance achieved by this first 24-bit P 1 design thus exceeds those reported by RAM RAM ... Fig. 10. Heat and Laplace equations McBryan et al. =-=[30]-=- [31], for solving the same problems with the help of supercomputers. A sequential computer must execute 20 billion instructions per second in order to reproduce the same computation. S. Hadinger and ... |

5 |
Implantation d�un algorithme de st�er�eovision par corr�elation sur m�emoire active programmable PeRLe�1� rapport de stage� Ecole des Mines de Paris� Centre de Math�ematiques Appliqu�ees� 06904 Sophia�Antipolis
- Moll
- 1993
(Show Context)
Citation Context ...dware implementation using four digital signal processors (DSP), developed jointly by INRIA and Matra MSII, performs the same task in 9.6 s. A P 1 implementation of the very same algorithm by L. Moll =-=[45]-=- runs over thirty times faster, in 0.28 s: a key step towards real-time stereo matching. This design uses the full 100 MB/s bandwidth available between P 1 and its host. It also relies on fast reconfi... |

4 |
R�esolution num�erique des �equations de Laplace et de la chaleur� rapport d�option� Ecole Polytechnique� 91128 Palaiseau Cedex
- Hadinger, Raynaud-Richard
- 1993
(Show Context)
Citation Context ...upercomputers. A sequential computer must execute 20 billion instructions per second in order to reproduce the same computation. S. Hadinger and P. Raynaud-Richard further improved the implementation =-=[32]-=-. Refining the statistical analysis, they show that the datapath width can be reduced to 16 bits provided the rounding-off of the low-order bit is done randomly---with all deterministic round-off sche... |

4 |
Implementation of long constraint length Viterbi decoders using Programmable Active Memories
- Keaney, Lee, et al.
- 1993
(Show Context)
Citation Context ...veloped [46]. R. Keaney and D. Skellern from Macquarie University (Sydney, Australia), together with M. Shand and J. Vuillemin from PRL, have implemented a Viterbi decoder for the Galileo code on P 1 =-=[47]-=-. Using on-board RAM to trace through the 2 14 possible states of the encoder, this design computes 4 states in parallel at each 40 ns clock cycle, for an overall decoding speed of 2 kb/s. The coding ... |

3 |
The Bioccelerator, product brief
- Compugen
- 1993
(Show Context)
Citation Context ...figuration is done, once and for all---until the next "hardware release". Commercial products already exist: QuickTurn [14] sells large configurable systems, dedicated to hardware emulation.=-= Compugen [15]-=- sells a modular PAM-like hardware, together with several configurations focusing on genetic matching algorithms. More systems exist than just the ones mentioned here. A thorough presentation of the i... |

3 |
Contribution a la resolution numerique des equations de Laplace et de la chaleur. To appear
- Vuillemin
- 1993
(Show Context)
Citation Context ... circuit technology, fluid dynamics, electrostatics, optics and finance [27]. The classical finite difference method [28] provides computational solutions to the heat and Laplace equations. Vuillemin =-=[29]-=- shows how to speed-up this computation with help from special purpose hardware. A first implementation of the method on P 1 , by Vuillemin and Rocheteau [29], operates with a pipeline depth of 128 op... |

3 |
Large-scale photospheric motions: first results from an extraordinary eleven-hour granulation observation
- Simon, Brandt, et al.
- 1994
(Show Context)
Citation Context ...tinuously. These attributes prove essential to one use of this interface---the principal image acquisition system at the Swedish Vacuum Solar Telescope where the system has been in use since May 1993 =-=[43]-=-. The success of this small PAM (or PAMette) has lead us to develop a new PAM board, I/O-oriented and of small size, to explore this new kind of applications. M. Shand, in collaboration with G. Scharm... |

3 |
A long constraint length VLSI Viterbi decoder for the DSN
- Statman, Zimmerman, et al.
- 1988
(Show Context)
Citation Context ...es with K = 7 and K = 8. NASA's Galileo space probe is equipped with a constraint length 15 rate 1/4 encoder, for which a Viterbi decoder based on an array of 256 custom VLSI chips is being developed =-=[46]-=-. R. Keaney and D. Skellern from Macquarie University (Sydney, Australia), together with M. Shand and J. Vuillemin from PRL, have implemented a Viterbi decoder for the Galileo code on P 1 [47]. Using ... |

2 |
The Configurable Logic Data Book
- Ltd
- 1990
(Show Context)
Citation Context ...n, on a large enough FPGA, for a slow enough clock. Some vendors, such as Xilinx [2] or AT&T [3], form their PABs from both configurable routing and logic blocks. Other early ones, such as Algotronix =-=[4]-=- (now with Xilinx) or Concurrent Logic [5] (now with Atmel), combine routing and computing functions into a single primitive---this is the fine grain school. An idealized implementation of this fine g... |

2 |
Ercegovac, "A variable precision multiplier for field-programmable gate arrays
- Louie, D
- 1994
(Show Context)
Citation Context ...s, which refined the design on the basis of actual performance measurements, were each developed in less than 5 man-days. A more aggressive multiplier design is latter reported by Louie and Ercegovac =-=[22]-=-: using radix 16 and deep pipeline, this multiplier operates at 79 MHz, which is 2.5 faster than ours within 3 times the area. At that speed, this design is faster than conventional multipliers, even ... |

2 |
DECPeRLe-1 implementation of NESTOR's first level trigger
- Katsanevas, Shand, et al.
- 1993
(Show Context)
Citation Context ...hter-board connectors (see section III-C). Provided the peak data rate can be accommodated---which is the case with the P 1 solution--- subsequent processing is straightforward (see Katsanevas et al. =-=[40]-=- for details). H. Image acquisition P 1 's TURBOchannel adapter (see section III-C), being built around a single XC3090, is a PAM in its own right--- albeit a small one. M. Shand [41] describes a numb... |

1 |
On computing power", Programming Languages and System
- Vuillemin
- 1994
(Show Context)
Citation Context ...implementations, let us, from now on, choose as our reference unit any active bit with one 4-input boolean function---configurable or not---and one internal bit of state (see section VI and Vuillemin =-=[7]-=-). With its five 5-input functions, the PAB from figure 1 counts for ten or so such units. The FPGA is a virtual circuit which can behave like a number of different ASICs: all it takes to emulate a pa... |

1 |
Chameleon, a workstation of a different color", Field Programmable Gate Arrays: Architecture and Tools for Rapid Prototyping
- Heeb, Pfister
- 1993
(Show Context)
Citation Context ...in some detail in an earlier report [9], then at DEC-PRL, other successful implementations of reconfigurable systems have been reported, in particular at the universities of Edinburgh [10] and Zurich =-=[11]-=-, and at the Supercomputer Research Center in Maryland [12]. The ENABLE machine is a system, built from FPGAs and SRAM, specifically constructed at the university of Mannheim [13] for solving the TRT ... |

1 |
Splash II
- Arnold, Buell, et al.
- 1992
(Show Context)
Citation Context ...ther successful implementations of reconfigurable systems have been reported, in particular at the universities of Edinburgh [10] and Zurich [11], and at the Supercomputer Research Center in Maryland =-=[12]-=-. The ENABLE machine is a system, built from FPGAs and SRAM, specifically constructed at the university of Mannheim [13] for solving the TRT problem of section VG. 2. Many similar application-specific... |

1 |
A survey of hardware implementations of RSA", Crypto '89
- Brickell
- 1990
(Show Context)
Citation Context ... boards, all operating in parallel with the host. At 200 kb/s decoding speed, this was faster than all existing 512-bit RSA implementations, regardless of technology, in 1990. A survey by E. Brickell =-=[24]-=- grants the previous speed record for 512-bit keys RSA decryption to a VLSI from AT&T, at 19 kb/s. 2 B1 B2 S 2 M + X X 256 32 32 Data In Data Out Fig. 8. RSA cryptography Table I recalls the various o... |

1 |
ProgrammableActive Memories in real-time tasks: implementing data-driven triggers for LHC experiments
- Belosloudtsev, Bertin, et al.
- 1995
(Show Context)
Citation Context ...urate quantitative analysis of the computing power required for this task: the PAM is the only structure found to meet this bound. This algorithm was implemented by P. Boucard and J. Vuillemin on P 1 =-=[37]-=- [38]. Using the external I/O capabilities described in section III-C, data is input from the detectors through two off-the-shelf HIPPI-to-TURBOchannel interface boards plugged directly onto P 1 . The... |

1 |
Vuillemin is a graduate from Ecole Polytechnique. He received a Ph. D. from Stanford University in 1972, and one from Paris University in 1974. He taught Computer Science at the University of California, Berkeley in 1975, and Universit'e d'Orsay from 1976
- Thacker
- 1993
(Show Context)
Citation Context ...AMs, with applications in a large number of domains. Table II updates what is feasible within 1994 technology. The technology curves for PAM cost/performance derive from those for FPGA and static RAM =-=[51]-=-; we can use them as a basis for extrapolation, from now into the future. Let us compare the respective merits of three possible implementation technologies, for a given specific highperformance syste... |