Results 1  10
of
10
Computer Vision Algorithms on Reconfigurable Logic Arrays
 IEEE TRANS. ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1999
"... Computer vision algorithms are natural candidates for high performance computing due to their inherent parallelism and intense computational demands. For example, a simple 3 x 3 convolution on a 512 x 512 gray scale image at 30 frames per second requires 67.5 million multiplications and 60 million a ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Computer vision algorithms are natural candidates for high performance computing due to their inherent parallelism and intense computational demands. For example, a simple 3 x 3 convolution on a 512 x 512 gray scale image at 30 frames per second requires 67.5 million multiplications and 60 million additions to be performed in one second. Computer vision tasks can be classified into three categories based on their computational complexity andcommunication complexity: lowlevel, intermediatelevel and highlevel. Specialpurpose hardware provides better performance compared to a generalpurpose hardware for all the three levels of vision tasks. With recent advances in very large scale integration (VLSI) technology, an application specific integrated circuit (ASIC) can provide the best performance in terms of total execution time. However, long design cycle time, high development cost and inflexibility of a dedicated hardware deter design of ASICs. In contrast, field programmable gate arrays (FPGAs) support lower design verification time and easier design adaptability atalower cost. Hence, FPGAs with an array of reconfigurable logic blocks canbevery useful compute elements. FPGAbased custom computing machines are
Portable and scalable algorithms for irregular alltoall communication
 In 16th ICDCS
, 1996
"... In irregular alltoall communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algori ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
In irregular alltoall communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algorithm reduces the total number of message startups. It also reduces node contention by smoothing out the lengths of the messages communicated. As compared to the earlier approaches, our algorithm provides deterministic performance and also reduces the buffer space at the nodes during message passing. The performance of the algorithm is characterised using a simple communication model of highperformance computing (HPC)platforms. We show the implementation on T3D and SP2 using C and the message passing interface standard. These can be easily ported to other HPC platforms. The results show the effectiveness of the proposed technique as well as the interplay among the machine size, the variance in message length, and the network
Parallel Object Recognition on an FPGAbased Configurable Computing Platform
 In International Workshop on Computer Architectures for Machine Perception
, 1997
"... Object recognition involves identifying known objects in a given scene. It plays a key role in image understanding. Geometric hashing has been proposed as a technique for modelbased object recognition in occluded scenes. However, parallel techniques are needed to realize realtime vision systems em ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
Object recognition involves identifying known objects in a given scene. It plays a key role in image understanding. Geometric hashing has been proposed as a technique for modelbased object recognition in occluded scenes. However, parallel techniques are needed to realize realtime vision systems employing geometric hashing. In this paper, we develop a design technique for parallelizing geometric hashing on an FPGAbased platform. We first transform the hash table which contains symbolic data into a bitlevel representation. By regularizing the data flow and exploiting bitlevel parallelism in hardware, our design achieves high performance. Using our approach, given a scene consisting of 256 feature points, a probe can be performed in 1.65 milliseconds on an FPGAbased platform having 32 Xilinx 4062s. In earlier implementations, the same probe operation was performed in 240 milliseconds on a 32Knode CM2 and in 382 milliseconds on a 32node CM5. Also, the same operation takes 40 millis...
A Fast Asynchronous Algorithm for Linear Feature Extraction on IBM SP2
 Proc. of the Computer Architectures for Machine Perception
, 1995
"... In this paper, we present a fast parallel implementation of linear feature extraction on IBM SP2. We first analyze the machine features and the problem characteristics to understand the overheads in parallel solutions to the problem. Based on these, we propose an asynchronous algorithm which enhanc ..."
Abstract
 Add to MetaCart
In this paper, we present a fast parallel implementation of linear feature extraction on IBM SP2. We first analyze the machine features and the problem characteristics to understand the overheads in parallel solutions to the problem. Based on these, we propose an asynchronous algorithm which enhances processor utilization and overlaps communication with computation by maintaining algorithmic threads in each processing node. Our implementation shows that, given a 512 \Theta 512 image, the linear feature extraction task can be performed in 0.065 seconds on a SP2 having 64 processing nodes. A serial implementation takes 3.45 seconds on a single processing node of SP2. A previous implementation on CM5 takes 0.1 second on a partition of 512 processing nodes. Experimental results on various sizes of images using 4, 8, 16, 32, and 64 processing nodes are also reported. 1 Introduction In distributed memory machines, the processing nodes are interconnected by an interconnection network an...
Parallel Implementations of Perceptual Grouping Tasks on Distributed Memory Machines
 Connection Machine CM5", International Conference on Pattern Recognition
, 1994
"... In this paper, we propose parallel implementations for solving Perceptual Grouping tasks on distributed memory machines. Our implementations show that, given 7K line segments extracted from a 1K \Theta 1K image, the Line Grouping task can be performed in 0.486 seconds using a partition of CM5 havi ..."
Abstract
 Add to MetaCart
In this paper, we propose parallel implementations for solving Perceptual Grouping tasks on distributed memory machines. Our implementations show that, given 7K line segments extracted from a 1K \Theta 1K image, the Line Grouping task can be performed in 0.486 seconds using a partition of CM5 having 256 processing nodes and in 0.382 seconds using a 16node Cray T3D. The serial implementation written in C takes 20.368 seconds and 4.181 seconds using 1node CM5 and 1node T3D respectively. Our code is written in C and MPI message passing standard and can be easily ported to other high performance computing platforms. 1 Introduction Many distributed memory machines are commercially available. These include IBM SP2, TMC CM5, Intel Paragon, Cray T3D, among others. The scalability of these machines as the machine size is varied and the flexibility of parallel program development using messagepassing makes them suitable for solving computer vision problems efficiently [Wang, 1995]. Per...
Parallelization of Perceptual Grouping on Distributed Memory Machines
 Proc. of Computer Architectures for Machine Perception
, 1995
"... In this paper, we propose architectureindependent parallel algorithms for solving Perceptual Grouping tasks on distributed memory machines. Given an n \Theta n image, using P processors, we show that these tasks can be performed in O( n 2 P ) computation time and 20 p PT d + 8(log P )T d + ..."
Abstract
 Add to MetaCart
In this paper, we propose architectureindependent parallel algorithms for solving Perceptual Grouping tasks on distributed memory machines. Given an n \Theta n image, using P processors, we show that these tasks can be performed in O( n 2 P ) computation time and 20 p PT d + 8(log P )T d + ( 40n p P +20P )ø d communication time, where T d is the communication startup time and ø d is the transmission rate. Our implementations show that, given 7K line segments extracted from a 1K \Theta 1K image, the Line Grouping task can be performed in 1.115 seconds using a partition of CM5 having 256 processing nodes and in 0.382 seconds using a 16node Cray T3D. Our code is written in C and MPI message passing standard and can be easily ported to other high performance computing platforms. 1 Introduction Many distributed memory machines are commercially available. These include IBM SP2, TMC CM5, Intel Paragon, Cray T3D, among others. The scalability of these machines as the machi...
Scalable Data Parallel Object Recognition using Geometric Hashing on CM5
 on the CM5. Scalable High Performance Computing Conference, SHPCC
, 1994
"... In this paper, we present scalable parallel algorithms for object recognition using geometric hashing. We define an abstract model of CM5. We develop a loadbalancing technique that results in scalable processortime optimal algorithms for performing a probe on the CM5 model. Given a model of CM5 w ..."
Abstract
 Add to MetaCart
In this paper, we present scalable parallel algorithms for object recognition using geometric hashing. We define an abstract model of CM5. We develop a loadbalancing technique that results in scalable processortime optimal algorithms for performing a probe on the CM5 model. Given a model of CM5 with P PNs and a set S of feature points in a scene, a probe of the recognition phase can be performed in O( jV (S)j P ) time, where V (S) is the set of votes cast by feature points in S. This algorithm is scalable in the range 1 P p jV (S)j= log jV (S)j. These results do not assume any distributions of hash bin lengths or scene points. The implementations developed in this paper require number of processors independent of the size of the model database and are scalable with the machine size. 1 Introduction Object recognition is a key step in an integrated vision system. Most modelbased recognition systems work by hypothesizing matches between scene features and model features, pred...
Parallel Algorithms for Linear Approximation on Distributed Memory Machines
"... In this paper, we summarize our results in parallelizing the linear approximation step on current distributed memory machines. We first analyze the features of current distributed memory machines and the problem characteristics to understand the overheads in parallel solutions to the problem. Based ..."
Abstract
 Add to MetaCart
In this paper, we summarize our results in parallelizing the linear approximation step on current distributed memory machines. We first analyze the features of current distributed memory machines and the problem characteristics to understand the overheads in parallel solutions to the problem. Based on these, we propose an asynchronous algorithm which enhances processor utilization and overlaps communication with computation by maintaining algorithmic threads in each processing node. Our implementation shows that, given a 512 \Theta 512 image, the linear approximation task can be performed in 0.015 seconds on a SP2 having 64 processing nodes and in 0.032 seconds on a T3D having 32 processing nodes. A serial implementation takes 0.445 seconds on a single processing node of SP2 and 0.779 seconds on a single processing node of T3D. Experimental results on various sizes of images using 4, 8, 16, 32, and 64 processing nodes are also reported. 1 Introduction In distributed memory machines...
A Robust Neural Network Based Object Recognition System and its SIMD Implementation
"... Recognition of objects is a particularly demanding problem, if one considers that each image must be interpreted in milliseconds (usually 30 or 40 frames/second). The problem becomes more difficult if the objects are distorted and/or partially occluded. In this case a sequence of local features ..."
Abstract
 Add to MetaCart
Recognition of objects is a particularly demanding problem, if one considers that each image must be interpreted in milliseconds (usually 30 or 40 frames/second). The problem becomes more difficult if the objects are distorted and/or partially occluded. In this case a sequence of local features are to be extracted, combined in a global shape description and classified as belonging to predefined sets of known shapes (reference shapes). In this paper we propose a massively parallel object recognition system, which makes use of the multi polygonal approximation scheme for the extraction of rotation and translation invariant shape features, in connection with artificial neural networks for the parallel classification of the extracted features. The system has been successfully applied for recognizing aircraft shapes in different sizes, orientations, with the addition of noise distortion and occlusion. Timings on the Connection Machine 200 are also reported. 1