Results 1 
9 of
9
Locality Of Reference In Lu Decomposition With Partial Pivoting
 SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS
, 1997
"... This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of refer ..."
Abstract

Cited by 96 (10 self)
 Add to MetaCart
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the rightlooking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the rightlooking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the rightlooking algorithm. Experimental comparisons between the new algorithm and the rightlooking algorithm show that an implementation of the new algorithm outperforms a similarly coded rightlooking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrixmultiplication algorithm.
RealTime Optical Flow
 MINNEAPOLIS MINNESOTA
, 1995
"... Currently two major limitations to applying vision in real tasks are robustness in realworld, uncontrolled environments, and the computational resources required for realtime operation. In particular, many current robotic visual motion detection algorithms (optical flow) are not suited for practica ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
Currently two major limitations to applying vision in real tasks are robustness in realworld, uncontrolled environments, and the computational resources required for realtime operation. In particular, many current robotic visual motion detection algorithms (optical flow) are not suited for practical applications such as segmentation and structurefrommotion because they either require highly specialized hardware or up to several minutes on a scientific workstation. In addition, many such algorithms depend on the computation of first and in some cases higher numerical derivatives, which are notoriously sensitive to noise. In fact the current trend in optical flow research is to stress accuracy under ideal conditions and not to consider computational resource requirements or resistance to noise, which are essential for realtime robotics. As a result robotic vision researchers are frustrated by an inability to obtain reliable optical flow estimates in realworld conditions, and practica...
Accuracy and SpeedUp of Parallel TraceDriven Architectural Simulation
 In Proc. Int’l Parallel Processing Symp., IEEE Computer Soc
, 1997
"... Tracedriven simulation continues to be one of the main evaluation methods in the design of high performance processormemory subsystems. In this paper, we examine the varying speedup opportunities available by processing a given trace in parallel on an IBM SP2 machine. We also develop a sim ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Tracedriven simulation continues to be one of the main evaluation methods in the design of high performance processormemory subsystems. In this paper, we examine the varying speedup opportunities available by processing a given trace in parallel on an IBM SP2 machine. We also develop a simple, yet effective method of correcting for coldstart cache miss errors, by the use of overlapped trace chunks. We then report selected experimental results to validate our expectations. We show that it is possible to achieve nearperfect speedup without loss of accuracy. Next, in order to achieve further reduction in simulation cost, we combine uniform sampling methods with parallel trace processing with a slight loss of accuracy for finitecache timer runs. We then show that by using warmstart sequences from preceding trace chunks, it is possible to reduce the errors back to acceptable bounds. 1. Introduction The everincreasing sizes of real workloads is making the use of trace...
The future fast fourier transform
 SIAM J. Sci. Computing
, 1999
"... It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
It seems likely that improvements in arithmetic speed will continue to outpace advances in communications bandwidth. Furthermore, as more and more problems are working on huge datasets, it is becoming increasingly likely that data will be distributed across many processors because one processor does not have sufficient storage capacity. For these reasons, we propose that an inexact DFT such as an approximate matrixvector approach based on singular values or a variation of the DuttRokhlin fastmultipolebased algorithm [9] may outperform any exact parallel FFT. The speedup may be as large as a factor of three in situations where FFT run time is dominated by communication. For the multipole idea we further propose that a method of “virtual charges ” may improve accuracy, and we provide an analysis of the singular values that are needed for the approximate matrixvector approaches. 1
Portable High Performance Programming via ArchitectureCognizant DivideandConquer Algorithms
, 2000
"... ...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
...................................................... xiii 1 Introduction .................................................. 1 1. DivideandConquer and the Memory Hierarchy . . . . . . . . . . . 2 2. Overview of ArchitectureCognizant Divideand Conquer . . . . . . 4 3. Overview of Napoleon . . . . . . . . . . . . . . . . . . . . . . . . . 5 4. What You Can Expect . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1. DivideandConquer Algorithms for Performance Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2. The Importance of ArchitectureCognizance . . . . . . . . . 7 3. Complexity of Determining VariantPolicy . . . . . . . . . . 7 4. A Framework and System for DivideandConquer Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5. The Fastest Portable FFT Algorithm . . . . . . . . . . . . . 8 6. Outline of Thesis . . . . . . . . . . . . . . . . ....
Algorithms for CompilerAssisted Design Space Exploration of Clustered VLIW ASIP Datapaths
, 2001
"... Clustered Very Large Instruction Word ApplicationSpecific Instruction Set Processors (VLIW ASIPs) combined with effective compilation techniques enable aggressive exploitation of the instruction level parallelism inherent in many embedded media applications, while unlocking a variety of possible pe ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Clustered Very Large Instruction Word ApplicationSpecific Instruction Set Processors (VLIW ASIPs) combined with effective compilation techniques enable aggressive exploitation of the instruction level parallelism inherent in many embedded media applications, while unlocking a variety of possible performance/cost tradeoffs. In this dissertation we propose and validate an algorithm to support early design space exploration (DSE) over classes of datapaths, in the context of a specific target application, and carry out an empirical study for a set of representative benchmarks. We argue that at an early DSE phase one should use design space parameters that have a firstorder impact on two key physical figures of merit: clock rate f and power dissipation P. We found these parameters to be: maximum cluster capacity (number of functional units in a cluster) NF, number of clusters NC, and the interconnect capacity NB. The experimental validation of our DSE algorithm shows that a thorough exploration of the complex design space can be performed very efficiently in this parameterized design space. Moreover, our case studies suggest that penalties of clustered versus nonclustered datapaths are often minimal and that clustering indeed unlocks a variety of valuable design alternatives. Our exploration methodology is enabled by an efficient algorithm for binding operations in a dataflow graph to the clusters of a datapath, so as to minimize latency and the number of data transfers. The algorithm utilizes effective cost and ranking functions that enable the exploration of complex tradeoffs between: (1) operation serialization, due to cluster overload; and (2) penalties incurred by data transfers, due to scattering operations with data dependencies over different clusters. The core binding algorithm has shown robustness over a large set of datapaths and application kernels, and demonstrated up to 29% improvement in schedule latency, as compared to a state of the art advanced binding algorithm.
A Uniform Internal Representation for HighLevel and InstructionLevel Transformations Eduard Ayguadé, Cristina Barrado, Jesús Labarta, David López, Susana Moreno, David Padua, and Mateo Valero
, 1995
"... this paper we describe a strategy that will make it possible, after applying a small number of changes, to represent lowlevel operations as part of the internal representation of a conventional sourcetosource Fortran translator. Briefly, our strategy is to represent the lowlevel operations as Fo ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
this paper we describe a strategy that will make it possible, after applying a small number of changes, to represent lowlevel operations as part of the internal representation of a conventional sourcetosource Fortran translator. Briefly, our strategy is to represent the lowlevel operations as Fortran statements. In this way, all the transformation and analysis routines available in the sourcetosource restructurer can be applied to the lowlevel representation of the program. The sourcetosource parallelizer could then be extended to include many traditional analysis and transformation steps, such as strength reduction and register allocation, not usually performed by this translator. The generation of machine instructions is done as a last step by a direct mapping from each Fortran statement onto one or more machine instructions. The sourcetosource restructurer is therefore extended into a complete compiler as shown in Figure 1. All transformations, including highlevel parallelization and the traditional scalar optimizations, can now be performed in a unified framework based on a single internal representation. One additional advantage of representing the lowlevel operations as Fortran statements is that the outcome of each transformation, both high and low level, can be trivially transformed into a Fortran program that could be executed to test the correctness of the transformation. Another approach that also uses a uniform representation for both highlevel parallelization and scalar optimizations was the one followed in the IBM Fortran compiler [ScKo86]. The main difference with our approach is that this compiler evolved from a traditional backend compiler which was extended to do some of the highlevel transformations usually performed in other systems by...
unit: Architecture and implementation
"... point performance. Its innovative multiplyadd fused extends the concept of the innovative (MAF) dataflow minimizes latency, rounding error, and multiplyadd fused (MAF) ALU of the RISC chip busing [l]. The MAF unit performs a doubleprecision System/6000 @ processor to provide a floating multiply ..."
Abstract
 Add to MetaCart
point performance. Its innovative multiplyadd fused extends the concept of the innovative (MAF) dataflow minimizes latency, rounding error, and multiplyadd fused (MAF) ALU of the RISC chip busing [l]. The MAF unit performs a doubleprecision System/6000 @ processor to provide a floating multiply in a single cycle and a doubleprecision add in the point unit that sets new standards, not only for following cycle. A single round occurs in the final and computation capability but for data throughput bypassable stage of the pipeline. The FPU combines, in a and processor flexibility. The POWER2 FPU single twostage pipeline, capabilities which many other achieves a performance (MFLOPS) rate processors, such as the SuperSPARC Microprocessor [2], never accomplished before by a personal provide with two units, usually a separate multiplier and workstation machine by 1) integrating dual adder. The simultaneous use of multiple execution units generic MAF ALUs, 2) doubling the instruction requires additional data buses as well as control logic for bandwidth and quadrupling the data bandwidth detecting dependencies across units. The architecture over that of the POWER FPU, 3) adding supports the exploitation of the MAF capability through a support for additional functions, and 4) using set of multiplyadd instructions. The RS/6000 processor dynamic instruction scheduling techniques to support of these instructions allows execution of a maximize instructionlevel parallelism not only dependent pair of operations with a combined latency of among its own internal units but with the rest only two cycles. This feature is unique in the industry. of the CPU. The POWER2TM FPU design goal is to build upon these strong points to provide a FPU that sets new standards not only for computation capability but also for