Results 1 - 10
of
236
Self-Testing/Correcting with Applications to Numerical Problems
, 1990
"... Suppose someone gives us an extremely fast program P that we can call as a black box to compute a function f . Should we trust that P works correctly? A self-testing/correcting pair allows us to: (1) estimate the probability that P (x) 6= f(x) when x is randomly chosen; (2) on any input x, compute ..."
Abstract
-
Cited by 298 (25 self)
- Add to MetaCart
Suppose someone gives us an extremely fast program P that we can call as a black box to compute a function f . Should we trust that P works correctly? A self-testing/correcting pair allows us to: (1) estimate the probability that P (x) 6= f(x) when x is randomly chosen; (2) on any input x, compute f(x) correctly as long as P is not too faulty on average. Furthermore, both (1) and (2) take time only slightly more than Computer Science Division, U.C. Berkeley, Berkeley, California 94720, Supported by NSF Grant No. CCR 88-13632. y International Computer Science Institute, Berkeley, California 94704 z Computer Science Division, U.C. Berkeley, Berkeley, California 94720, Supported by an IBM Graduate Fellowship and NSF Grant No. CCR 88-13632. the original running time of P . We present general techniques for constructing simple to program selftesting /correcting pairs for a variety of numerical problems, including integer multiplication, modular multiplication, matrix multiplicatio...
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract
-
Cited by 163 (7 self)
- Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
Algebraic Attacks on Stream Ciphers with Linear Feedback
, 2003
"... A classical construction of stream ciphers is to combine several LFSRs and a highly non-linear Boolean function f . Their security is usually studied in terms of correlation attacks, that can be seen as solving a system of multivariate linear equations, true with some probability. At ICISC'02 thi ..."
Abstract
-
Cited by 157 (19 self)
- Add to MetaCart
A classical construction of stream ciphers is to combine several LFSRs and a highly non-linear Boolean function f . Their security is usually studied in terms of correlation attacks, that can be seen as solving a system of multivariate linear equations, true with some probability. At ICISC'02 this approach is extended to systems of higher-degree multivariate equations, and gives an attack in 2 for Toyocrypt, a Cryptrec submission.
Tensor Decompositions and Applications
- SIAM REVIEW
, 2009
"... This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or N -way array. Decompositions of higher-order tensors (i.e., N -way arrays with N ⥠3) have applications in psychometrics, chemometrics, signal proce ..."
Abstract
-
Cited by 95 (13 self)
- Add to MetaCart
This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or N -way array. Decompositions of higher-order tensors (i.e., N -way arrays with N ⥠3) have applications in psychometrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, etc. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decompo-
sition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank-one tensors, and the Tucker decomposition is a higher-order form of principal components analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox and Tensor Toolbox, both for MATLAB, and the Multilinear Engine are examples of software packages for working with tensors.
Locality Of Reference In Lu Decomposition With Partial Pivoting
- SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS
, 1997
"... This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of refer ..."
Abstract
-
Cited by 84 (6 self)
- Add to MetaCart
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the right-looking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the right-looking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the right-looking algorithm. Experimental comparisons between the new algorithm and the right-looking algorithm show that an implementation of the new algorithm outperforms a similarly coded right-looking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrix-multiplication algorithm.
GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark
- ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1998
"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."
Abstract
-
Cited by 74 (8 self)
- Add to MetaCart
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and high-performance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMM-based level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMM-based level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMM-based model implementations.
Auto-Blocking Matrix-Multiplication or Tracking BLAS3 Performance from Source Code
- In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking hand-coded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the in-pla ..."
Abstract
-
Cited by 69 (6 self)
- Add to MetaCart
An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking hand-coded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the in-place algorithm against manufacturer's hand-tuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent block-oriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Information-flow and data-flow analysis of while-programs
- ACM Transactions on Programming Languages and Systems
, 1985
"... Until recently, information-flow analysis has been used primarily to verify that information trans-mission between program variables cannot violate security requirements. Here, the notion of infor-mation flow is explored as an aid to program development and validation. Information-flow relations are ..."
Abstract
-
Cited by 69 (0 self)
- Add to MetaCart
Until recently, information-flow analysis has been used primarily to verify that information trans-mission between program variables cannot violate security requirements. Here, the notion of infor-mation flow is explored as an aid to program development and validation. Information-flow relations are presented for while-programs, which identify those program statements whose execution may cause information to be transmitted from or to particular input, internal, or output values. It is shown with examples how these flow relations can be helpful in writing, testing, and updating programs; they also usefully extend the class of errors which can be detected automatically in the “static analysis ” of a program.
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. ..."
Abstract
-
Cited by 67 (4 self)
- Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, non-programmable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by re-ordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (2--5% of total running time) and high performance benefits (reducing execution time by factors of 1.1-2.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursion-based control structures may be needed to fully exploit their potential.
Fast Estimation of Diameter and Shortest Paths (without Matrix Multiplication)
, 1996
"... this paper is organized as follows. We begin by presenting some definitions and useful observations in Section 2. In Section 3, we describe the algorithms for distinguishing between graphs of diameter 2 and 4, and the extension to obtaining a ratio 2=3 approximation to the diameter. Then, in Section ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
this paper is organized as follows. We begin by presenting some definitions and useful observations in Section 2. In Section 3, we describe the algorithms for distinguishing between graphs of diameter 2 and 4, and the extension to obtaining a ratio 2=3 approximation to the diameter. Then, in Section 4, we apply the ideas developed in estimating the diameter to obtain the promised algorithm for an additive approximation for APSP. Finally, in Section 5 we present an empirical study of the performance of our algorithm for all-pairs shortest paths.

