Results 1  10
of
15
Locality Of Reference In Lu Decomposition With Partial Pivoting
 SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS
, 1997
"... This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of refer ..."
Abstract

Cited by 95 (9 self)
 Add to MetaCart
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the rightlooking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the rightlooking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the rightlooking algorithm. Experimental comparisons between the new algorithm and the rightlooking algorithm show that an implementation of the new algorithm outperforms a similarly coded rightlooking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrixmultiplication algorithm.
RealTime Optical Flow
 MINNEAPOLIS MINNESOTA
, 1995
"... Currently two major limitations to applying vision in real tasks are robustness in realworld, uncontrolled environments, and the computational resources required for realtime operation. In particular, many current robotic visual motion detection algorithms (optical flow) are not suited for practica ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
Currently two major limitations to applying vision in real tasks are robustness in realworld, uncontrolled environments, and the computational resources required for realtime operation. In particular, many current robotic visual motion detection algorithms (optical flow) are not suited for practical applications such as segmentation and structurefrommotion because they either require highly specialized hardware or up to several minutes on a scientific workstation. In addition, many such algorithms depend on the computation of first and in some cases higher numerical derivatives, which are notoriously sensitive to noise. In fact the current trend in optical flow research is to stress accuracy under ideal conditions and not to consider computational resource requirements or resistance to noise, which are essential for realtime robotics. As a result robotic vision researchers are frustrated by an inability to obtain reliable optical flow estimates in realworld conditions, and practica...
The future fast fourier transform
 SIAM J. Sci. Computing
, 1999
"... It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does not have sufficient storage capacity. For these reasons, we propose that an inexact DFT such as an approximate matrixvector approach based on singular values or a variation of the DuttRokhlin fastmultipolebased algorithm [9] may outperform any exact parallel FFT. The speedup may be as large as a factor of three in situations where FFT run time is dominated by communication. For the multipole idea we further propose that a method of “virtual charges ” may improve accuracy, and we provide an analysis of the singular values that are needed for the approximate matrixvector approaches. 1
Accuracy and SpeedUp of Parallel TraceDriven Architectural Simulation
 In Proc. Int’l Parallel Processing Symp., IEEE Computer Soc
, 1997
"... Tracedriven simulation continues to be one of the main evaluation methods in the design of high performance processormemory subsystems. In this paper, we examine the varying speedup opportunities available by processing a given trace in parallel on an IBM SP2 machine. We also develop a sim ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Tracedriven simulation continues to be one of the main evaluation methods in the design of high performance processormemory subsystems. In this paper, we examine the varying speedup opportunities available by processing a given trace in parallel on an IBM SP2 machine. We also develop a simple, yet effective method of correcting for coldstart cache miss errors, by the use of overlapped trace chunks. We then report selected experimental results to validate our expectations. We show that it is possible to achieve nearperfect speedup without loss of accuracy. Next, in order to achieve further reduction in simulation cost, we combine uniform sampling methods with parallel trace processing with a slight loss of accuracy for finitecache timer runs. We then show that by using warmstart sequences from preceding trace chunks, it is possible to reduce the errors back to acceptable bounds. 1. Introduction The everincreasing sizes of real workloads is making the use of trace...
Portable High Performance Programming via ArchitectureCognizant DivideandConquer Algorithms
, 2000
"... ...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. What You Can Expect . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1. DivideandConquer Algorithms for Performance Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. The Importance of ArchitectureCognizance . . . . . . . . . 7 3. Complexity of Determining VariantPolicy . . . . . . . . . . 7 4. A Framework and System for DivideandConquer Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. The Fastest Portable FFT Algorithm . . . . . . . . . . . . . 8 6. Outline of Thesis . . . . . . . . . . . . . . . . ....
Algorithms for CompilerAssisted Design Space Exploration of Clustered VLIW ASIP Datapaths
, 2001
"... Clustered Very Large Instruction Word ApplicationSpecific Instruction Set Processors (VLIW ASIPs) combined with effective compilation techniques enable aggressive exploitation of the instruction level parallelism inherent in many embedded media applications, while unlocking a variety of possible pe ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Clustered Very Large Instruction Word ApplicationSpecific Instruction Set Processors (VLIW ASIPs) combined with effective compilation techniques enable aggressive exploitation of the instruction level parallelism inherent in many embedded media applications, while unlocking a variety of possible performance/cost tradeoffs. In this dissertation we propose and validate an algorithm to support early design space exploration (DSE) over classes of datapaths, in the context of a specific target application, and carry out an empirical study for a set of representative benchmarks. We argue that at an early DSE phase one should use design space parameters that have a firstorder impact on two key physical figures of merit: clock rate f and power dissipation P. We found these parameters to be: maximum cluster capacity (number of functional units in a cluster) NF, number of clusters NC, and the interconnect capacity NB. The experimental validation of our DSE algorithm shows that a thorough exploration of the complex design space can be performed very efficiently in this parameterized design space. Moreover, our case studies suggest that penalties of clustered versus nonclustered datapaths are often minimal and that clustering indeed unlocks a variety of valuable design alternatives. Our exploration methodology is enabled by an efficient algorithm for binding operations in a dataflow graph to the clusters of a datapath, so as to minimize latency and the number of data transfers. The algorithm utilizes effective cost and ranking functions that enable the exploration of complex tradeoffs between: (1) operation serialization, due to cluster overload; and (2) penalties incurred by data transfers, due to scattering operations with data dependencies over different clusters. The core binding algorithm has shown robustness over a large set of datapaths and application kernels, and demonstrated up to 29% improvement in schedule latency, as compared to a state of the art advanced binding algorithm.
A Uniform Internal Representation for HighLevel and InstructionLevel Transformations Eduard Ayguadé, Cristina Barrado, Jesús Labarta, David López, Susana Moreno, David Padua, and Mateo Valero
, 1995
"... this paper we describe a strategy that will make it possible, after applying a small number of changes, to represent lowlevel operations as part of the internal representation of a conventional sourcetosource Fortran translator. Briefly, our strategy is to represent the lowlevel operations as Fo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
this paper we describe a strategy that will make it possible, after applying a small number of changes, to represent lowlevel operations as part of the internal representation of a conventional sourcetosource Fortran translator. Briefly, our strategy is to represent the lowlevel operations as Fortran statements. In this way, all the transformation and analysis routines available in the sourcetosource restructurer can be applied to the lowlevel representation of the program. The sourcetosource parallelizer could then be extended to include many traditional analysis and transformation steps, such as strength reduction and register allocation, not usually performed by this translator. The generation of machine instructions is done as a last step by a direct mapping from each Fortran statement onto one or more machine instructions. The sourcetosource restructurer is therefore extended into a complete compiler as shown in Figure 1. All transformations, including highlevel parallelization and the traditional scalar optimizations, can now be performed in a unified framework based on a single internal representation. One additional advantage of representing the lowlevel operations as Fortran statements is that the outcome of each transformation, both high and low level, can be trivially transformed into a Fortran program that could be executed to test the correctness of the transformation. Another approach that also uses a uniform representation for both highlevel parallelization and scalar optimizations was the one followed in the IBM Fortran compiler [ScKo86]. The main difference with our approach is that this compiler evolved from a traditional backend compiler which was extended to do some of the highlevel transformations usually performed in other systems by...
Hardware Interval Multipliers
, 1996
"... This paper presents serial and parallel hardware units for interval multiplication. Compared to software implementations, these units greatly increase the performance of interval multiplication by providing automatic interval endpoint selection and correct rounding of the results. Area and delay es ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
This paper presents serial and parallel hardware units for interval multiplication. Compared to software implementations, these units greatly increase the performance of interval multiplication by providing automatic interval endpoint selection and correct rounding of the results. Area and delay estimates show that compared to a conventional IEEE multiplier, the serial interval multiplier requires only 15 percent more area and has a worst case delay that is five percent longer. Depending on the input operands, it requires either two or five cycles to compute the interval product. The parallel interval multiplier uses dual multipliers to compute two interval endpoints simultaneously. It has the same worst case delay as the serial interval multiplier, yet requires only one or three cycles to compute the interval product. Compared to previous hardware interval multipliers, our designs are one to three times faster and require less area.
IV
, 1994
"... To my family and my friends III Acknowledgments The author wishes to express his gratitude to Dr. Wood for his supervision and support of this work. He also wishes to thank Dr. Parhami and Ching YuHung for supplying the computer resources needed for his work. Special thanks to our workteam in comp ..."
Abstract
 Add to MetaCart
(Show Context)
To my family and my friends III Acknowledgments The author wishes to express his gratitude to Dr. Wood for his supervision and support of this work. He also wishes to thank Dr. Parhami and Ching YuHung for supplying the computer resources needed for his work. Special thanks to our workteam in computer