## Design and Implementation of the MorphoSys Reconfigurable Computing Processor (2000)

Venue: | Journal of VLSI and Signal Processing-Systems for Signal, Image and Video Technology |

Citations: | 13 - 0 self |

### BibTeX

@INPROCEEDINGS{Lee00designand,

author = {Ming-hau Lee and Hartej Singh and Guangming Lu and Nader Bagherzadeh and Fadi J. Kurdahi and Fadi and J. Kurdahi},

title = {Design and Implementation of the MorphoSys Reconfigurable Computing Processor},

booktitle = {Journal of VLSI and Signal Processing-Systems for Signal, Image and Video Technology},

year = {2000},

publisher = {Kluwer Academic Publishers}

}

### OpenURL

### Abstract

. In this paper, we describe the implementation of MorphoSys, a reconfigurable processing system targeted at data-parallel and computation-intensive applications. The MorphoSys architecture consists of a reconfigurable component (an array of reconfigurable cells) combined with a RISC control processor and a high bandwidth memory interface. We briefly discuss the system-level model, array architecture, and control processor. Next, we present the detailed design implementation and the various aspects of physical layout of different sub-blocks of MorphoSys. The physical layout was constrained for 100 MHz operation, with low power consumption, and was implemented using 0.35 m, four metal layer CMOS (3.3 Volts) technology. We provide simulation results for the MorphoSys architecture (based on VHDL model) for some typical data-parallel applications (video compression and automatic target recognition). The results indicate that the MorphoSys system can achieve significantly better performance...

### Citations

410 |
Digital Integrated Circuits: A Design Perspective, Upper Saddle River
- Rabaey
- 1996
(Show Context)
Citation Context ...8 bits ALU from SPICE simulation is within 3 ns. It consumes 15 mw of power at 25 ◦ C for 100 MHz operation. Shifter. It is a logarithmic shifter with a maximum shift width of 16 bits. As depicted i=-=n [19]-=-, for large shift values, the logarithmic shifter is effective both in terms of area and speed, therefore, it is used for MorphoSys implementation. Critical path. The critical path of the RC, which is... |

342 | Garp: a mips processor with a reconfigurable coprocessor
- HAUSER, WAWRZYNEK
- 1997
(Show Context)
Citation Context ...rmance for word-level datapath operations. Hence, many researchers have proposed 1prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [4], Garp =-=[5]-=-, PADDI [6], MATRIX [7], RaPiD [8], REMARC [9], and RAW [10]. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This mod... |

233 |
Computer Arithmetic Algorithms
- Koren
(Show Context)
Citation Context ...udget allows approximately 3 ns for ALU operations. The carry-ripple adder is too slow to accomplish 28 bits addition/subtraction operations in 3 ns. Both carry-lookahead adder and carry-select adder =-=[18]-=- are well-known schemes for high speed adder design, however, they require twice as much area as the carry-ripple adder. Consequently, we use carryskip [18] scheme (that uses almost the same area as t... |

214 |
A suggestion for a fast multiplier
- Wallace
- 1964
(Show Context)
Citation Context ... CSA designs. From the data in Table 3 the CSA design using CPL2 has the lowest delay-power product, hence, it is used in current implementation. Several researchers [14] have shown that both Wallace =-=[15]-=- and Dadda [16] algorithms are efficient for array type multipliers and can be implemented using the minimum number of CSAs. However, we use a regular array structure instead, which requires more CSAs... |

172 |
Fast Computational Algorithm for the Discrete Cosine Transform
- Chen, Smith, et al.
- 1997
(Show Context)
Citation Context ... Compression: Discrete Cosine Transform (DCT) for MPEG The forward and inverse DCT are used in MPEG encoders and decoders. In the following analysis, we consider an algorithm for fast 8-point 1-D DCT =-=[24].-=- It involves 16 multiplications and 26 additions, leading to 256 multiplications and 416 additions for a 2-D implementation. The 1-D algorithm is first applied to the rows (columns) of an input 8 × 8... |

107 |
Some schemes for parallel multipliers
- Dadda
- 1965
(Show Context)
Citation Context ...rom the data in Table 3 the CSA design using CPL2 has the lowest delay-power product, hence, it is used in current implementation. Several researchers [14] have shown that both Wallace [15] and Dadda =-=[16]-=- algorithms are efficient for array type multipliers and can be implemented using the minimum number of CSAs. However, we use a regular array structure instead, which requires more CSAs and has a long... |

88 | MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources
- Mirsky, DeHon
- 1996
(Show Context)
Citation Context ... datapath operations. Hence, many researchers have proposed prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [3], Garp [4], PADDI [5], MATRIX =-=[6]-=-, RaPiD [7], REMARC [8], and RAW [9]. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This model is aimed at applicati... |

76 |
Building and using a highly parallel programmable logic array
- GOKHALE, HOLMES, et al.
- 1991
(Show Context)
Citation Context ...h widths of a 23s150 Lee et al. few bits) and coarse-grain (basic processing elements have data-paths of eight or sixteen bits or more). Research prototypes with fine-grain granularity include Splash =-=[12]-=-, DPGA [3] and Garp [4]. Reconfigurable processors with coarse-grain granularity are PADDI [5], MATRIX [6], RaPiD [7], and REMARC [8]. MorphoSys is a coarse-grain architecture, since the target applic... |

71 |
A Two’s Complement Parallel Array Multiplication Algorithm
- Baugh, Wooley
- 1973
(Show Context)
Citation Context ...tant design consideration is to find an efficient algorithm for 2’s complement multiplication. The regular CSA array structure leads us to the decision of sign extension and array reduction algorith=-=m [17]-=- because the partial products can be generated without any recoding. Also, the summation of the partial products can be carried out by carry-save adders directly without any modification. Figure 5 sho... |

54 |
A family of VLSI designs for the motion compensation block-matching algorithms
- Yang, Sun, et al.
- 1989
(Show Context)
Citation Context ...iption of the mapping can be found in [11]. For a reference block size of 16 ×16 and image size of 352 × 288 34 Table 7. Performance comparison for motion estimation. Pentium MorphoSys ASIC [21] ASI=-=C [22] MMX-=- # of clock cycles 1020 581 1159 29000 Processing time 10.2 µs 2.9 µ 5.8 µs 145 µs pixels at 30 frames per second (MPEG-2 main profile, low level), the processing of an entire image takes about 21... |

46 |
A Reconfigurable Multiprocessor IC for Rapid Prototyping of Algorithmic-Specific High-Speed DSP Data Paths
- Chen, Rabaey
- 1992
(Show Context)
Citation Context ...r word-level datapath operations. Hence, many researchers have proposed prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [3], Garp [4], PADDI =-=[5]-=-, MATRIX [6], RaPiD [7], REMARC [8], and RAW [9]. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This model is aimed ... |

44 | Architecture of FPGAs and CPLDs: A Tutorial
- Brown, Rose
- 1996
(Show Context)
Citation Context ...nce typical of ASIC devices and also provides the flexibility of a general-purpose processor (i.e. it can execute a wide range of applications). Conventionally, field programmable gate arrays (FPGAs) =-=[2]-=- are the most common devices used for implementing reconfigurable components. This is because FPGAs allow designers to manipulate gate-level devices such as flipflops, memory and other logic gates. Ho... |

44 | A Quantitative Analysis of Reconfigurable Coprocessors for Multimedia Applications
- Miyamori, Olukotun
- 1998
(Show Context)
Citation Context ...ence, many researchers have proposed prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [3], Garp [4], PADDI [5], MATRIX [6], RaPiD [7], REMARC =-=[8]-=-, and RAW [9]. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This model is aimed at applications that feature high d... |

40 |
VLSI Architecture for Block-Matching Motion Estimation Algorithm
- Hsieh, Lin
- 1992
(Show Context)
Citation Context ...imulations. 6.1. Video Compression: Motion Estimation for MPEG Motion Estimation is the most computation-intensive algorithm in MPEG. Among the different algorithms, full search block matching (FSBM) =-=[21] -=-involves the maximum computations, however, gives an optimal solution with low control overhead. The detail description of the mapping can be found in [11]. For a reference block size of 16 ×16 and i... |

39 | The RAW benchmark suite: computation structures for general purpose computing
- Babb, Frank, et al.
- 1997
(Show Context)
Citation Context ...archers have proposed 1prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [4], Garp [5], PADDI [6], MATRIX [7], RaPiD [8], REMARC [9], and RAW =-=[10]-=-. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This model is aimed at applications that feature high data-paralleli... |

19 | Automated Target Recognition on SPLASH 2
- Rencher, Hutchings
- 1997
(Show Context)
Citation Context ...26] developed at Sandia National Laboratory has been mapped to MorphoSys [12]. For performance analysis, we chose the system parameters that were used in [26]. The ATR systems implemented in [26] and =-=[27]-=- were used for comparison. Two Xilinx 4013 FPGAs (one dynamic FPGA for most of the computations and one static FPGA for control) are used in Mojave [26], and Splash 2 system (consisting of 16 Xilinx 4... |

16 |
A MIPS Processor with a Reconfigurable Coprocessor
- Hauser, Wawrzynek
(Show Context)
Citation Context ...formance for word-level datapath operations. Hence, many researchers have proposed prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [3], Garp =-=[4]-=-, PADDI [5], MATRIX [6], RaPiD [7], REMARC [8], and RAW [9]. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This mode... |

16 | Configurable Computing: The Catalyst for HighPerformance Architectures
- Ebeling, Cronquist, et al.
- 1997
(Show Context)
Citation Context ...perations. Hence, many researchers have proposed prototypes of reconfigurable computing systems that employ non-FPGA reconfigurable components such as DPGA [3], Garp [4], PADDI [5], MATRIX [6], RaPiD =-=[7]-=-, REMARC [8], and RAW [9]. In this paper, we describe the implementation of MorphoSys, which is based on a novel model of a reconfigurable computing system. This model is aimed at applications that fe... |

13 |
A 3.8-ns CMOS 16 _ 16b Multiplier Using Complementary PassTransistor Logic
- Yano
- 1990
(Show Context)
Citation Context .... A16×12 multiplier is implemented in RC. This is the component that requires the maximum area and has the longest delay in the RC. Therefore, we use complementary pass-transistor logic (CPL) circuit=-= [13]-=- for designing the multiplier. CPL allows the realization of complex logic functions with minimum number of transistors. It also features high speed operation and low power consumption. Figure 3(a) (C... |

3 |
Automatic Target Recognition on SPLASH
- Rencher, Hutchings
- 1997
(Show Context)
Citation Context ...26] developed at Sandia National Laboratory has been mapped to MorphoSys [12]. For performance analysis, we chose the system parameters that were used in [26]. The ATR systems implemented in [26] and =-=[27]-=- were used for comparison. Two Xilinx 4013 FPGAs (one dynamic FPGA for most of the computations and one static FPGA for control) are used in Mojave [26], and Splash 2 system (consisting of 16 Xilinx 4... |

3 |
The Power Consumption of CMOS Adders and Multipliers
- Callaway, Swartzlander
- 1998
(Show Context)
Citation Context ...ndard CMOS and CPL Carry-Save Adder Design Standard CMOS CPL1 CPL2 Number of transistors 40 28 30 Delay 0.54 0.22 0.20 (0.35 m, 3.3V) Power (100MHz, 25 o C) 0.21 mw 0.36mw 0.18 mw Several researchers =-=[14]-=- have shown that both Wallace [15] and Dadda [16] algorithms are efficient for array type multipliers and can be implemented using the minimum number of CSAs. However, we use a regular array structure... |

1 | to 1996, he was a graduate student of the Department of Computer Science at the Peking - From - 1994 |

1 |
Intel Application Notes for Pentium MMX, http://developer.intel.com/drg/mmx/appnotes/ 28 W-H
- Chen, Fralick
- 1977
(Show Context)
Citation Context ...s is the absolute difference and accumulation unit. Based on our simulation, the two ASIC systems can operate at about 200 MHz in ~ 0.35 m technology. We used a 233 MHz for Pentium MMX implementation =-=[23]-=-, which is the highest clock rate for ~ 0.35 m Pentium processor. Taking into account the clock rate, we depict the performance comparison in Table 7. The result shows that MorphoSys can deliver an or... |