Results 1  10
of
312
SelfTesting/Correcting with Applications to Numerical Problems
, 1990
"... Suppose someone gives us an extremely fast program P that we can call as a black box to compute a function f . Should we trust that P works correctly? A selftesting/correcting pair allows us to: (1) estimate the probability that P (x) 6= f(x) when x is randomly chosen; (2) on any input x, compute ..."
Abstract

Cited by 340 (26 self)
 Add to MetaCart
Suppose someone gives us an extremely fast program P that we can call as a black box to compute a function f . Should we trust that P works correctly? A selftesting/correcting pair allows us to: (1) estimate the probability that P (x) 6= f(x) when x is randomly chosen; (2) on any input x, compute f(x) correctly as long as P is not too faulty on average. Furthermore, both (1) and (2) take time only slightly more than Computer Science Division, U.C. Berkeley, Berkeley, California 94720, Supported by NSF Grant No. CCR 8813632. y International Computer Science Institute, Berkeley, California 94704 z Computer Science Division, U.C. Berkeley, Berkeley, California 94720, Supported by an IBM Graduate Fellowship and NSF Grant No. CCR 8813632. the original running time of P . We present general techniques for constructing simple to program selftesting /correcting pairs for a variety of numerical problems, including integer multiplication, modular multiplication, matrix multiplicatio...
Tensor Decompositions and Applications
 SIAM REVIEW
, 2009
"... This survey provides an overview of higherorder tensor decompositions, their applications, and available software. A tensor is a multidimensional or N way array. Decompositions of higherorder tensors (i.e., N way arrays with N â¥ 3) have applications in psychometrics, chemometrics, signal proce ..."
Abstract

Cited by 228 (14 self)
 Add to MetaCart
This survey provides an overview of higherorder tensor decompositions, their applications, and available software. A tensor is a multidimensional or N way array. Decompositions of higherorder tensors (i.e., N way arrays with N â¥ 3) have applications in psychometrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, etc. Two particular tensor decompositions can be considered to be higherorder extensions of the matrix singular value decompo
sition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rankone tensors, and the Tucker decomposition is a higherorder form of principal components analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The Nway Toolbox and Tensor Toolbox, both for MATLAB, and the Multilinear Engine are examples of software packages for working with tensors.
Algebraic Attacks on Stream Ciphers with Linear Feedback
, 2003
"... A classical construction of stream ciphers is to combine several LFSRs and a highly nonlinear Boolean function f . Their security is usually studied in terms of correlation attacks, that can be seen as solving a system of multivariate linear equations, true with some probability. At ICISC'02 thi ..."
Abstract

Cited by 203 (22 self)
 Add to MetaCart
A classical construction of stream ciphers is to combine several LFSRs and a highly nonlinear Boolean function f . Their security is usually studied in terms of correlation attacks, that can be seen as solving a system of multivariate linear equations, true with some probability. At ICISC'02 this approach is extended to systems of higherdegree multivariate equations, and gives an attack in 2 for Toyocrypt, a Cryptrec submission.
Programming Parallel Algorithms
, 1996
"... In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a th ..."
Abstract

Cited by 193 (9 self)
 Add to MetaCart
In the past 20 years there has been treftlendous progress in developing and analyzing parallel algorithftls. Researchers have developed efficient parallel algorithms to solve most problems for which efficient sequential solutions are known. Although some ofthese algorithms are efficient only in a theoretical framework, many are quite efficient in practice or have key ideas that have been used in efficient implementations. This research on parallel algorithms has not only improved our general understanding ofparallelism but in several cases has led to improvements in sequential algorithms. Unf:ortunately there has been less success in developing good languages f:or prograftlftling parallel algorithftls, particularly languages that are well suited for teaching and prototyping algorithms. There has been a large gap between languages
An analysis of dagconsistent distributed sharedmemory algorithms
 in Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA
, 1996
"... In this paper, we analyze the performance of parallel multithreaded algorithms that use dagconsistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKE ..."
Abstract

Cited by 105 (22 self)
 Add to MetaCart
In this paper, we analyze the performance of parallel multithreaded algorithms that use dagconsistent distributed shared memory. Specifically, we analyze execution time, page faults, and space requirements for multithreaded algorithms executed by a FP(C) workstealing thread scheduler and the BACKER algorithm for maintaining dag consistency. We prove that if the accesses to the backing store are random and independent (the BACKER algorithm actually uses hashing), the expected execution time TP(C)of a “fully strict” multithreaded computation on P processors, each with a LRU cache of C pages, is O(T1(C)=P+mCT∞), where T1(C)is the total work of the computation including page faults, T ∞ is its criticalpath length excluding page faults, and m is the minimum page transfer time. As
Locality Of Reference In Lu Decomposition With Partial Pivoting
 SIAM JOURNAL ON MATRIX ANALYSIS AND APPLICATIONS
, 1997
"... This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of refer ..."
Abstract

Cited by 96 (10 self)
 Add to MetaCart
This paper presents a new partitioned algorithm for LU decomposition with partial pivoting. The new algorithm, called the recursively partitioned algorithm, is based on a recursive partitioning of the matrix. The paper analyzes the locality of reference in the new algorithm and the locality of reference in a known and widely used partitioned algorithm for LU decomposition called the rightlooking algorithm. The analysis reveals that the new algorithm performs a factor of $\Theta(\sqrt{M/n})$ fewer I/O operations (or cache misses) than the rightlooking algorithm, where $n$ is the order of the matrix and $M$ is the size of primary memory. The analysis also determines the optimal block size for the rightlooking algorithm. Experimental comparisons between the new algorithm and the rightlooking algorithm show that an implementation of the new algorithm outperforms a similarly coded rightlooking algorithm on six different RISC architectures, that the new algorithm performs fewer cache misses than any other algorithm tested, and that it benefits more from Strassen's matrixmultiplication algorithm.
GEMMBased Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
, 1998
"... The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. Howev ..."
Abstract

Cited by 89 (8 self)
 Add to MetaCart
The level 3 Basic Linear Algebra Subprograms (BLAS) are designed to perform various matrix multiply and triangular system solving computations. Due to the complex hardware organization of advanced computer architectures the development of optimal level 3 BLAS code is costly and time consuming. However, it is possible to develop a portable and highperformance level 3 BLAS library mainly relying on a highly optimized GEMM, the routine for the general matrix multiply and add operation. With suitable partitioning, all the other level 3 BLAS can be defined in terms of GEMM and a small amount of level 1 and level 2 computations. Our contribution is twofold. First, the model implementations in Fortran 77 of the GEMMbased level 3 BLAS are structured to reduced effectively data traffic in a memory hierarchy. Second, the GEMMbased level 3 BLAS performance evaluation benchmark is a tool for evaluating and comparing different implementations of the level 3 BLAS with the GEMMbased model implementations.
AutoBlocking MatrixMultiplication or Tracking BLAS3 Performance from Source Code
 In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, 1997
"... An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inpla ..."
Abstract

Cited by 76 (6 self)
 Add to MetaCart
An elementary, machineindependent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimal code, tracking handcoded BLAS3 routines. ``Proof of concept'' is demonstrated by racing the inplace algorithm against manufacturer's handtuned BLAS3 routines; it can win. The recursive code bifurcates naturally at the top level into independent blockoriented processes, that each writes to a disjoint and contiguous region of memory. Experience has shown that the indexing vastly improves the patterns of memory access at all levels of the memory hierarchy, independently of the sizes of caches or pages and without ad hoc programming. It also exposed a weakness in SGI's C compilers that merrily unroll loops for the superscalar R8000 processor, but do not analogously unfold the base cases of the most elementary recursions. Such deficiencies might deter programmers from using this rich class of recursive algorithms.
Nonlinear Array Layouts for Hierarchical Memory Systems
, 1999
"... Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. ..."
Abstract

Cited by 72 (5 self)
 Add to MetaCart
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an invisible, nonprogrammable array attribute. In reality, modern memory systems are architecturally hierarchical rather than flat, with substantial differences in performance among different levels of the hierarchy. This mismatch between the model and the true architecture of memory systems can result in low locality of reference and poor performance. Some of this loss in performance can be recovered by reordering computations using transformations such as loop tiling. We explore nonlinear array layout functions as an additional means of improving locality of reference. For a benchmark suite composed of dense matrix kernels, we show by timing and simulation that two specific layouts (4D and Morton) have low implementation costs (25% of total running time) and high performance benefits (reducing execution time by factors of 1.12.5); that they have smooth performance curves, both across a wide range of problem sizes and over representative cache architectures; and that recursionbased control structures may be needed to fully exploit their potential.
Solving Large Sparse Linear Systems Over Finite Fields
, 1991
"... Many of the fast methods for factoring integers and computing discrete logarithms require the solution of large sparse linear systems of equations over finite fields. This paper presents the results of implementations of several linear algebra algorithms. It shows that very large sparse systems can ..."
Abstract

Cited by 72 (2 self)
 Add to MetaCart
Many of the fast methods for factoring integers and computing discrete logarithms require the solution of large sparse linear systems of equations over finite fields. This paper presents the results of implementations of several linear algebra algorithms. It shows that very large sparse systems can be solved efficiently by using combinations of structured Gaussian elimination and the conjugate gradient, Lanczos, and Wiedemann methods. 1. Introduction Factoring integers and computing discrete logarithms often requires solving large systems of linear equations over finite fields. General surveys of these areas are presented in [14, 17, 19]. So far there have been few implementations of discrete logarithm algorithms, but many of integer factoring methods. Some of the published results have involved solving systems of over 6 \Theta 10 4 equations in more than 6 \Theta 10 4 variables [12]. In factoring, equations have had to be solved over the field GF (2). In that situation, ordinary...