Results 1  10
of
11
Computer Vision Algorithms on Reconfigurable Logic Arrays
 IEEE TRANS. ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1999
"... Computer vision algorithms are natural candidates for high performance computing due to their inherent parallelism and intense computational demands. For example, a simple 3 x 3 convolution on a 512 x 512 gray scale image at 30 frames per second requires 67.5 million multiplications and 60 million a ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
Computer vision algorithms are natural candidates for high performance computing due to their inherent parallelism and intense computational demands. For example, a simple 3 x 3 convolution on a 512 x 512 gray scale image at 30 frames per second requires 67.5 million multiplications and 60 million additions to be performed in one second. Computer vision tasks can be classified into three categories based on their computational complexity andcommunication complexity: lowlevel, intermediatelevel and highlevel. Specialpurpose hardware provides better performance compared to a generalpurpose hardware for all the three levels of vision tasks. With recent advances in very large scale integration (VLSI) technology, an application specific integrated circuit (ASIC) can provide the best performance in terms of total execution time. However, long design cycle time, high development cost and inflexibility of a dedicated hardware deter design of ASICs. In contrast, field programmable gate arrays (FPGAs) support lower design verification time and easier design adaptability atalower cost. Hence, FPGAs with an array of reconfigurable logic blocks canbevery useful compute elements. FPGAbased custom computing machines are
Portable and scalable algorithms for irregular alltoall communication
 In 16th ICDCS
, 1996
"... In irregular alltoall communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algori ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
In irregular alltoall communication, messages are exchanged between every pair of processors. The message sizes vary from processor to processor and are known only at run time. This is a fundamental communication primitive in parallelizing irregularly structured scientific computations. Our algorithm reduces the total number of message startups. It also reduces node contention by smoothing out the lengths of the messages communicated. As compared to the earlier approaches, our algorithm provides deterministic performance and also reduces the buffer space at the nodes during message passing. The performance of the algorithm is characterised using a simple communication model of highperformance computing (HPC)platforms. We show the implementation on T3D and SP2 using C and the message passing interface standard. These can be easily ported to other HPC platforms. The results show the effectiveness of the proposed technique as well as the interplay among the machine size, the variance in message length, and the network
Parallel object recognition on an FPGAbased configurable computing platform
 International Workshop on Computer Architectures for Machine Perception
, 1997
"... ..."
(Show Context)
Portable and Scalable Algorithms for Irregular AlltoAll Communication
"... In this paper, we develop portable and scalable algorithms for performing irregular alltoall communication in High Performance Computing (HPC) systems. To minimize the communication latency, the algorithm reduces the total number of messages transmitted, reduces the variance of the lengths of thes ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper, we develop portable and scalable algorithms for performing irregular alltoall communication in High Performance Computing (HPC) systems. To minimize the communication latency, the algorithm reduces the total number of messages transmitted, reduces the variance of the lengths of these messages, and overlaps the communication with computation. The performance of the algorithm is characterized using a simple model of HPC systems. Our implementations are performed using the Message Passing Interface (MPI) standard and they can be ported to various HPC platforms. The performance of our algorithms is evaluated on CM5, T3D and SP2. The results show the effectiveness of the techniques as well as the interplay between the architectural features, the machine size, and the variance of message lengths. The experiences of our study can be applied in other HPC systems to optimize the performance of collective communication operations. 1.
A Robust Neural Network Based Object Recognition System and its SIMD Implementation
"... Recognition of objects is a particularly demanding problem, if one considers that each image must be interpreted in milliseconds (usually 30 or 40 frames/second). The problem becomes more difficult if the objects are distorted and/or partially occluded. In this case a sequence of local features ..."
Abstract
 Add to MetaCart
Recognition of objects is a particularly demanding problem, if one considers that each image must be interpreted in milliseconds (usually 30 or 40 frames/second). The problem becomes more difficult if the objects are distorted and/or partially occluded. In this case a sequence of local features are to be extracted, combined in a global shape description and classified as belonging to predefined sets of known shapes (reference shapes). In this paper we propose a massively parallel object recognition system, which makes use of the multi polygonal approximation scheme for the extraction of rotation and translation invariant shape features, in connection with artificial neural networks for the parallel classification of the extracted features. The system has been successfully applied for recognizing aircraft shapes in different sizes, orientations, with the addition of noise distortion and occlusion. Timings on the Connection Machine 200 are also reported. 1
Parallelization of Perceptual Grouping on Distributed Memory Machines
 Proc. of Computer Architectures for Machine Perception
, 1995
"... In this paper, we propose architectureindependent parallel algorithms for solving Perceptual Grouping tasks on distributed memory machines. Given an n \Theta n image, using P processors, we show that these tasks can be performed in O( n 2 P ) computation time and 20 p PT d + 8(log P )T d + ..."
Abstract
 Add to MetaCart
In this paper, we propose architectureindependent parallel algorithms for solving Perceptual Grouping tasks on distributed memory machines. Given an n \Theta n image, using P processors, we show that these tasks can be performed in O( n 2 P ) computation time and 20 p PT d + 8(log P )T d + ( 40n p P +20P )ø d communication time, where T d is the communication startup time and ø d is the transmission rate. Our implementations show that, given 7K line segments extracted from a 1K \Theta 1K image, the Line Grouping task can be performed in 1.115 seconds using a partition of CM5 having 256 processing nodes and in 0.382 seconds using a 16node Cray T3D. Our code is written in C and MPI message passing standard and can be easily ported to other high performance computing platforms. 1 Introduction Many distributed memory machines are commercially available. These include IBM SP2, TMC CM5, Intel Paragon, Cray T3D, among others. The scalability of these machines as the machi...
A Fast Asynchronous Algorithm for Linear Feature Extraction on IBM SP2
 Proc. of the Computer Architectures for Machine Perception
, 1995
"... In this paper, we present a fast parallel implementation of linear feature extraction on IBM SP2. We first analyze the machine features and the problem characteristics to understand the overheads in parallel solutions to the problem. Based on these, we propose an asynchronous algorithm which enhanc ..."
Abstract
 Add to MetaCart
In this paper, we present a fast parallel implementation of linear feature extraction on IBM SP2. We first analyze the machine features and the problem characteristics to understand the overheads in parallel solutions to the problem. Based on these, we propose an asynchronous algorithm which enhances processor utilization and overlaps communication with computation by maintaining algorithmic threads in each processing node. Our implementation shows that, given a 512 \Theta 512 image, the linear feature extraction task can be performed in 0.065 seconds on a SP2 having 64 processing nodes. A serial implementation takes 3.45 seconds on a single processing node of SP2. A previous implementation on CM5 takes 0.1 second on a partition of 512 processing nodes. Experimental results on various sizes of images using 4, 8, 16, 32, and 64 processing nodes are also reported. 1 Introduction In distributed memory machines, the processing nodes are interconnected by an interconnection network an...
Parallel Implementations of Perceptual Grouping Tasks on Distributed Memory Machines
 Connection Machine CM5&quot;, International Conference on Pattern Recognition
, 1994
"... In this paper, we propose parallel implementations for solving Perceptual Grouping tasks on distributed memory machines. Our implementations show that, given 7K line segments extracted from a 1K \Theta 1K image, the Line Grouping task can be performed in 0.486 seconds using a partition of CM5 havi ..."
Abstract
 Add to MetaCart
In this paper, we propose parallel implementations for solving Perceptual Grouping tasks on distributed memory machines. Our implementations show that, given 7K line segments extracted from a 1K \Theta 1K image, the Line Grouping task can be performed in 0.486 seconds using a partition of CM5 having 256 processing nodes and in 0.382 seconds using a 16node Cray T3D. The serial implementation written in C takes 20.368 seconds and 4.181 seconds using 1node CM5 and 1node T3D respectively. Our code is written in C and MPI message passing standard and can be easily ported to other high performance computing platforms. 1 Introduction Many distributed memory machines are commercially available. These include IBM SP2, TMC CM5, Intel Paragon, Cray T3D, among others. The scalability of these machines as the machine size is varied and the flexibility of parallel program development using messagepassing makes them suitable for solving computer vision problems efficiently [Wang, 1995]. Per...