Results 1 - 10
of
26
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggres ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically−scheduled superscalar processor. Chimaera is capable of performing 9−input/1−output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control−flow into RFU operations, and (3) supporting a more powerful fine−grain data−parallel model than that supported by current multimedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22 % of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub−shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4−way out−of−order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
User Transparency: A Fully Sequential Programming Model for Efficient Data Parallel Image Processing
- Science, University of Amsterdam, The Netherlands
, 2002
"... Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing rese ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
Although many image processing applications are ideally suited for parallel implementation, most researchers in imaging do not benefit from high performance computing on a daily basis. Essentially, this is due to the fact that no parallelization tools exist that truly match the image processing researcher's frame of reference. As it is unrealistic to expect imaging researchers to become experts in parallel computing, tools must be provided to allow them to develop high performance applications in a highly familiar manner. In an attempt to provide such a tool, we have designed a software architecture that allows transparent (i.e., sequential) implementation of data parallel imaging applications for execution on homogeneous distributed memory MIMD-style multicomputers. This paper presents an extensive overview of the design rationale behind the software architecture, and gives an assessment of the architecture's e#ectiveness in providing significant performance gains. In particular, we describe the implementation and automatic parallelization of three well-known example applications that contain many fundamental imaging operations: (1) template matching, (2) multi-baseline stereo vision, and (3) line detection. Based on experimental results we conclude that our software architecture constitutes a powerful and user-friendly tool for obtaining high performance in many important image processing research areas.
A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions
- In 16th International Workshop of Languages and Compilers for Parallel Computing
, 2003
"... In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming and the use of system libraries and intrin ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
In 1994, the first multimedia extension, MAX-1, was introduced to general-purpose processors by HP. Almost ten years have passed, the present means of accessing the computing power of multimedia extensions are still limited to mostly assembly programming and the use of system libraries and intrinsic functions. Because of the similarity between multimedia extensions and vector processors, it is believed that traditional vectorization can be used to compile multimedia extensions. Can traditional vectorization effectively vectorize for multimedia extensions? If not, what additional techniques are needed? This paper tries to answer these two questions. Based on a code study of the Berkeley Multimedia Workload, we identify several new challenges arise in vectorizing for multimedia extensions, and provide some solutions to these challenges.
Subword Sorting with Versatile Permutation Instructions
, 2002
"... Subword parallelism has succeeded in accelerating many multimedia applications. Subword permutation instructions have been proposed to efficiently rearrange subwords in or among registers. Bit-level permutation instructions have also been proposed recently for their importance in cryptography. Howev ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
Subword parallelism has succeeded in accelerating many multimedia applications. Subword permutation instructions have been proposed to efficiently rearrange subwords in or among registers. Bit-level permutation instructions have also been proposed recently for their importance in cryptography. However, some important algorithms, especially ones with lots of conditional control dependencies such as sorting, have not exploited the advantage of subword parallel instructions. In this paper, we show how one of the bit permutation instructions, GRP, can be used for fast sorting. In the process, we demonstrate the versatility of this permutation instruction for uses other than bit permutations. This versatility is important in considering the addition of a new instruction to a general-purpose processor. The results show that our sorting methods have a significant speedup even when compared with the fastest sorting algorithms. We also discuss the hardware implementation of the GRP instruction and compare its latency to a typical processor's cycle time.
The vector floating-point unit in a synergistic processor element of a CELL processor,” submitted to 2005
- IEEE 17th Symposium on Computer Arithmetic (ARITH-17
"... The floating-point unit in the Synergistic Processor Element of the 1st generation multi-core CELL Processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
The floating-point unit in the Synergistic Processor Element of the 1st generation multi-core CELL Processor is described. The FPU supports 4-way SIMD single precision and integer operations and 2-way SIMD double precision operations. The design required a high-frequency, low latency, power and area efficiency with primary application to the multimedia streaming workloads, such as 3D graphics. The FPU has 3 different latencies, optimizing the performance critical single precision FMA operations, which are executed with a 6-cycle latency at an 11FO4 cycle time. The latency includes the global forwarding of the result. These challenging performance, power, and area goals were achieved through the co-design of architecture and implementation with optimizations at all levels of the design. This paper focuses on the logical and algorithmic aspects of the FPU we developed, to achieve these goals. 1
Fast Stereo Matching for the VIDET System using a General Purpose Processor with Multimedia Extensions
- In Fifth IEEE International Workshop on Computer Architecture for Machine Perception
, 2000
"... The ever-increasing speed of current general purpose processors, together with architectural enhancements such as multimedia-oriented instruction set extensions, allow for deploying standard PC-based systems in a number of computationally intensive computer vision tasks. This paper describes the PC- ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
The ever-increasing speed of current general purpose processors, together with architectural enhancements such as multimedia-oriented instruction set extensions, allow for deploying standard PC-based systems in a number of computationally intensive computer vision tasks. This paper describes the PC-based real-time stereo vision system developed within the VIDET project, which is a research project aimed at the development of a mobility aid for the visually impaired. VIDET's approach consists in the conversion of depth data gathered through a stereo vision system into a 3D model perceivable by the user by means of a wireactuated haptic interface. The developed stereo matching algorithm makes massive use of recursion and multimedia instructions to achieve the performance figures needed to sustain user's real-time interaction with the 3D model through the haptic interface.
Bit Permutation Instructions: Architecture, Implementation and Cryptographic Properties
, 2004
"... ii ..."
Variable-Correction Truncated Floating Point Multipliers
- in Proceedings of the Thirty Fourth Asilomar Conference on Signals, Circuits and Systems
, 2000
"... About half the hardware for floating point multipliers is needed only to guarantee correctly rounded results. For multimedia, graphics, and DSP systems, a significant reduction in area, delay, and power can be achieved by producing results that are not correctly rounded. This paper presents an ef ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
About half the hardware for floating point multipliers is needed only to guarantee correctly rounded results. For multimedia, graphics, and DSP systems, a significant reduction in area, delay, and power can be achieved by producing results that are not correctly rounded. This paper presents an efficient method for designing variable-correction truncated floating point multipliers that produce results with a maximum error of less than one unit in the last place. With this method, several of the less significant columns of the significand multiplier and the rounding logic for floating point multiplication are eliminated. Technical areas: (13) DSP hardware, software, and coreware; (14) ASIC and FPGA algorithm/processor design. POC: Michael Schulte, 19 Memorial Dr. West, EECS Dept., Lehigh University, Bethlehem, PA 18015. Email: mschulte@eecs.lehigh.edu, Phone: (610) 758-5036, FAX: (610) 758-6279. Extended Abstract Most modern processors perform floating point operations accord...
Retargeting Sequential Image-Processing Programs for Data-Parallel Execution
- IEEE Trans. on Software Engineering
, 2005
"... Abstract—New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel arch ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract—New compact, low-power implementation technologies for processors and imaging arrays can enable a new generation of portable video products. However, software compatibility with large bodies of existing applications written in C prevents more efficient, higher performance data parallel architectures from being used in these embedded products. If this software could be automatically retargeted explicitly for data parallel execution, product designers could incorporate these architectures into embedded products. The key challenge is exposing the parallelism that is inherent in these applications but that is obscured by artifacts imposed by sequential programming languages. This paper presents a recognition-based approach for automatically extracting a data parallel program model from sequential image processing code and retargeting it to data parallel execution mechanisms. The explicitly parallel model presented, called multidimensional data flow (MDDF), captures a model of how operations on data regions (e.g., rows, columns, and tiled blocks) are composed and interact. To extract an MDDF model, a partial recognition technique is used that focuses on identifying array access patterns in loops, transforming only those program elements that hinder parallelization, while leaving the core algorithmic computations intact. The paper presents results of retargeting a set of production programs to a representative data parallel processor array to demonstrate the capacity to extract parallelism using this technique. The retargeted applications yield a potential execution throughput limited only by the number of processing elements, exceeding thousands of instructions per cycle in massively parallel implementations. Index Terms—Reengineering, SIMD processors, data-level parallelization, explicitly parallel program representation, program recognition.

