Results 11  20
of
84
The Paprica Massively Parallel Processor
 In Proceedings MPCS  IEEE International Conference on Massively Parallel Computing Systems
, 1994
"... This paper describes a complete 6year project, starting from its theoretical basis up to the hardware and software system implementation, and to the description of its future evolution. The main goal of the project is to develop a subsystem that operates as a processing unit attached to a standard ..."
Abstract

Cited by 15 (8 self)
 Add to MetaCart
This paper describes a complete 6year project, starting from its theoretical basis up to the hardware and software system implementation, and to the description of its future evolution. The main goal of the project is to develop a subsystem that operates as a processing unit attached to a standard workstation and in perspective as a lowcost lowsized specialized embedded system devoted to low level image analysis and cellular neural networks emulation. The architecture has been extensively used for basic low level image analysis tasks up to optical flow computation and feature tracking, showing encouraging performances even in the first prototype version. 1 Introduction The PAPRICA system [5, 11] (an acronym for PArallel PRocessor for Image Checking and Analysis) described in this paper and shown in fig. 1 has the main characteristics of a conventional meshconnected SIMD array but it has been specialized to the following objectives: ffl to directly support a computational paradig...
A Novel Deterministic Sampling Scheme with Applications to BroadcastEfficient Sorting on the Reconfigurable Mesh
 Journal of Parallel and Distributed Computing
, 1996
"... The main contribution of this work is to present a simple deterministic sampling strategy that, when used for bucket sorting, yields buckets that are remarkably well balanced, making costly balancing unnecessary. To the best of our knowledge this is the first instance of a deterministic sampling ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
The main contribution of this work is to present a simple deterministic sampling strategy that, when used for bucket sorting, yields buckets that are remarkably well balanced, making costly balancing unnecessary. To the best of our knowledge this is the first instance of a deterministic sampling strategy featuring this performance. Although the strategy is perfectly general, we illustrate its power by devising a VLSIoptimal, O(1) time sorting algorithm for the reconfigurable mesh. As a byproduct of the inherent simplicity of our sampling and bucketing scheme we show that our sorting algorithm can be implemented using only 35 broadcast operations, a substantial improvement over the previously best known algorithm that requires 59 broadcasts. Keywords: deterministic sampling, bucket sort, reconfigurable meshes, sorting, VLSI optimal algorithms, constanttime algorithms 1 Introduction Sorting is, unquestionably, one of the fundamental operations in computer science. A natura...
An Algebraic Theory for Modeling Multistage Interconnection Networks
 Journal of Information Science and Engineering
, 1993
"... We use an algebraic theory based on tensor products to model multistage interconnection networks. This algebraic theory has been used for designing and implementing block recursive numerical algorithms on sharedmemory vector multiprocessors. In this paper, we focus on the modeling of multistage int ..."
Abstract

Cited by 14 (11 self)
 Add to MetaCart
We use an algebraic theory based on tensor products to model multistage interconnection networks. This algebraic theory has been used for designing and implementing block recursive numerical algorithms on sharedmemory vector multiprocessors. In this paper, we focus on the modeling of multistage interconnection networks. The tensor product representations of the baseline network, the reverse baseline network, the indirect binary ncube network, the generalized cube network, the omega network, and the flip network are given. We present the use of this theory for specifying and verifying network properties such as network partitioning and topological equivalence. Algorithm mapping using tensor product formulation is demonstrated by mapping the matrix transposition algorithm onto multistage interconnection networks. Keywords: Tensor product, parallel architecture, multistage interconnection network, partitionability, topological equivalence, algorithm mapping. 1 Introduction Tensor prod...
The Evaluation of Massively Parallel Array Architectures
, 1994
"... Computer Science to the memory of my mother Acknowledgments This dissertation would not have been possible without the help of many people. First, I would like to thank my committee for their many helpful comments and suggestions. Specifically, Al Hanson who taught me about computer vision, Wayne Bu ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
Computer Science to the memory of my mother Acknowledgments This dissertation would not have been possible without the help of many people. First, I would like to thank my committee for their many helpful comments and suggestions. Specifically, Al Hanson who taught me about computer vision, Wayne Burleson who taught me about VLSI, and Don Towsley who taught me about performance evaluation. Most especially, I’d like to thank my committee chair and my advisor and mentor for my entire graduate career, Chip Weems. Besides teaching me about architecture and writing, he suggested the final form of the topic, pulled me out of many blind alleys, and his vast store of knowledge was a constant help. Many other professors at UMass also contributed to my knowledge of computer science and so helped me with this dissertation. I would especially like to thank Arny Rosenberg who not only taught me theory but more importantly how and where to apply it, and Ed Riseman who’s boundless energy and optimism serves as a model for all of us. The first level of discussion and comments is always with the fellow graduate students in one’s
Square Meshes Are Not Optimal For Convex Hull Computation
 IEEE Transactions on Parallel and Distributed Systems
"... Recently it has been noticed that for semigroup computations and for selection rectangular meshes with multiple broadcasting yield faster algorithms than their square counterparts. The contribution of this paper is to provide yet another example of a fundamental problem for which this phenomenon ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
Recently it has been noticed that for semigroup computations and for selection rectangular meshes with multiple broadcasting yield faster algorithms than their square counterparts. The contribution of this paper is to provide yet another example of a fundamental problem for which this phenomenon occurs. Specifically, we show that the problem of computing the convex hull of a set of n sorted points in the plane can be solved in O(n 1 8 log 3 4 n) time on a rectangular mesh with multiple broadcasting of size n 3 8 log 1 4 n \Theta n 5 8 log 1 4 n . The fastest previouslyknown algorithms on a square mesh of size p n \Theta p n run in O(n 1 6 ) time in case the n points are pixels in a binary image, and in O(n 1 6 log 2 3 n) time for sorted points in the plane. Keywords: convex hulls, meshes with broadcasting, parallel algorithms, pattern recognition, image processing, computational geometry. 1 Introduction One of the fundamental heuristics in pat...
Exploiting symmetry on parallel architectures
, 1995
"... This thesis describes techniques for the design of parallel programs that solvewellstructured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a groupequivariant matrix. Fast techniques for this multiplication are described ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
This thesis describes techniques for the design of parallel programs that solvewellstructured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a groupequivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over nite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetryexploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral groupequivariant matrix is described. This code runs faster than previous serial programs, and discovered a number of results. Second, parallel algorithms for Fourier transforms for nite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct nbody problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.
A CMOS generalpurpose sampleddata analogue microprocessor
 ISCAS 2000
, 2000
"... This paper presents a generalpurpose sampleddata analogue processing element that essentially functions as an analogue microprocessor (AµP). The AµP executes software programs, in a way akin to a digital microprocessor, while nevertheless operating on analogue sampled data values. This enables the ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
This paper presents a generalpurpose sampleddata analogue processing element that essentially functions as an analogue microprocessor (AµP). The AµP executes software programs, in a way akin to a digital microprocessor, while nevertheless operating on analogue sampled data values. This enables the design of mixedmode systems which retain the speed/area/power advantages of the analogue signal processing paradigm while being fully programmable, generalpurpose systems. A proofofconcept integrated circuit has been implemented in 0.8 µm CMOS technology, using switchedcurrent techniques. Experimental results and examples of the application of the AµPs in image processing are presented. 1.
Minimal Adaptive Routing on the Mesh with Bounded Queue Size
, 1994
"... An adaptive routing algorithm is one in which the path a packet takes from its source to its destination may depend on other packets it encounters. Such algorithms potentially avoid network bottlenecks by routing packets around "hot spots." Minimal adaptive routing algorithms have the additional ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
An adaptive routing algorithm is one in which the path a packet takes from its source to its destination may depend on other packets it encounters. Such algorithms potentially avoid network bottlenecks by routing packets around "hot spots." Minimal adaptive routing algorithms have the additional advantage that the path each packet takes is a shortest one. For a large class of minimal adaptive routing algorithms, we present an \Omega# n 2 =k 2 ) bound on the worst case time to route a static permutation of packets on an n 2 n mesh or torus with nodes that can hold up to k 1 packets each. This is the first nontrivial lower bound on adaptive routing algorithms. The argument extends to more general routing problems, such as the hh routing problem. It also extends to a large class of dimension order routing algorithms, yielding an \Omega# n 2 =k) time bound. To complement these lower bounds, we present two upper bounds. One is an O(n 2 =k) time dimension order routing...
Convexity Problems on Meshes with Multiple Broadcasting
 Journal of Parallel and Distributed Computing
, 1992
"... Our contribution is twofold. First, we show that \Omega\Gammaat/ n) is a time lower bound on the CREWPRAM and the mesh with multiple broadcasting for the tasks of computing the perimeter, the area, the diameter, the width, the modality, the smallestarea enclosing rectangle, and the largestarea in ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Our contribution is twofold. First, we show that \Omega\Gammaat/ n) is a time lower bound on the CREWPRAM and the mesh with multiple broadcasting for the tasks of computing the perimeter, the area, the diameter, the width, the modality, the smallestarea enclosing rectangle, and the largestarea inscribed triangle of a convex ngon. We show that the same time lower bound holds for the tasks of detecting whether a convex ngon lies inside another as well as for computing the maximum distance between two convex ngons. We obtain our time lower bound results for the CREWPRAM by using a novel technique involving geometric constructions. These constructions allow us to reduce the wellknown OR problem to each of the geometric problems of interest. We then port these time lower bounds to the mesh with multiple broadcasting using simulation results. Our second contribution is to show that the \Omega\Gammae/1 n) time lower bound is tight by providing O(log n) time algorithms to solve these p...
Efficient Image Processing Algorithms on the Scan Line Array Processor
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 1995
"... We develop efficient algorithms for low and intermediate level image processing on the scan line array processor, a SIMD machine consisting of a linear array of cells that processes images in a scan line fashion. For low level processing, we present algorithms for block DFT, block DCT, convolution, ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
We develop efficient algorithms for low and intermediate level image processing on the scan line array processor, a SIMD machine consisting of a linear array of cells that processes images in a scan line fashion. For low level processing, we present algorithms for block DFT, block DCT, convolution, template matching, shrinking, and expanding which run in realtime. By realtime, we mean that, if the required processing is based on neighborhoods of size m \Theta m, then the output lines are generated at a rate of O(m) operations per line and a latency of O(m) scan lines, which is the best that can be achieved on this model. We also develop an algorithm for median filtering which runs in almost realtime at a cost of O(m log m) time per scan line and a latency of b m 2 c scan lines. For intermediate level processing, we present optimal algorithms for translation, histogram computation, scaling, and rotation. We also develop efficient algorithms for labelling the connected components...