Results 1 - 10
of
53
Revisiting the Nyström method for improved large-scale machine learning
"... We reconsider randomized algorithms for the low-rank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and pro ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
(Show Context)
We reconsider randomized algorithms for the low-rank approximation of SPSD matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods, and they point to differences between uniform and nonuniform sampling methods based on leverage scores. We complement our empirical results with a suite of worst-case theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds—e.g., improved additive-error bounds for spectral and Frobenius norm error and relative-error bounds for trace norm error. 1.
OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings
, 2012
"... An oblivious subspace embedding (OSE) given some parameters ε, d is a distribution D over matrices Π ∈ R m×n such that for any linear subspace W ⊆ R n with dim(W) = d it holds that PΠ∼D(∀x ∈ W ‖Πx‖2 ∈ (1 ± ε)‖x‖2)> 2/3. We show an OSE exists with m = O(d 2 /ε 2) and where every Π in the support ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
(Show Context)
An oblivious subspace embedding (OSE) given some parameters ε, d is a distribution D over matrices Π ∈ R m×n such that for any linear subspace W ⊆ R n with dim(W) = d it holds that PΠ∼D(∀x ∈ W ‖Πx‖2 ∈ (1 ± ε)‖x‖2)> 2/3. We show an OSE exists with m = O(d 2 /ε 2) and where every Π in the support of D has exactly s = 1 non-zero entries per column. This improves the previously best known bound in [Clarkson-Woodruff, arXiv abs/1207.6365]. Our quadratic dependence on d is optimal for any OSE with s = 1 [Nelson-Nguy ˜ ên, 2012]. We also give two OSE’s, which we call Oblivious Sparse Norm-Approximating Projections (OSNAPs), that both allow the parameter settings m = Õ(d/ε2) and s = polylog(d)/ε, or m = O(d1+γ /ε2) and s = O(1/ε) for any constant γ> 0. 1 This m is nearly optimal since m ≥ d is required simply to ensure no non-zero vector of W lands in the kernel of Π. These are the first constructions with m = o(d 2) to have s = o(d). In fact, our OSNAPs are nothing more than the sparse Johnson-Lindenstrauss matrices of [Kane-Nelson, SODA 2012]. Our analyses all yield OSE’s that are sampled using either O(1)-wise or O(log d)wise
Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression
, 2012
"... Low-distortion embeddings are critical building blocks for developing random sampling and random projection algo-rithms for common linear algebra problems. We show that, given a matrix A ∈ Rn×d with n d and a p ∈ [1, 2), with a constant probability, we can construct a low-distortion em-bedding matr ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Low-distortion embeddings are critical building blocks for developing random sampling and random projection algo-rithms for common linear algebra problems. We show that, given a matrix A ∈ Rn×d with n d and a p ∈ [1, 2), with a constant probability, we can construct a low-distortion em-bedding matrix Π ∈ RO(poly(d))×n that embeds Ap, the `p subspace spanned by A’s columns, into (RO(poly(d)), ‖ · ‖p); the distortion of our embeddings is only O(poly(d)), and we can compute ΠA in O(nnz(A)) time, i.e., input-sparsity time. Our result generalizes the input-sparsity time `2 sub-space embedding by Clarkson and Woodruff [STOC’13]; and for completeness, we present a simpler and improved analy-sis of their construction for `2. These input-sparsity time `p embeddings are optimal, up to constants, in terms of their running time; and the improved running time propagates to applications such as (1 ± )-distortion `p subspace embed-ding and relative-error `p regression. For `2, we show that a (1 + )-approximate solution to the `2 regression problem specified by the matrix A and a vector b ∈ Rn can be com-puted in O(nnz(A) + d3 log(d/)/2) time; and for `p, via a subspace-preserving sampling procedure, we show that a (1 ± )-distortion embedding of Ap into RO(poly(d)) can be computed in O(nnz(A) · logn) time, and we also show that a (1 + )-approximate solution to the `p regression problem minx∈Rd ‖Ax − b‖p can be computed in O(nnz(A) · logn + poly(d) log(1/)/2) time. Moreover, we can also improve the embedding dimension or equivalently the sample size to O(d3+p/2 log(1/)/2) without increasing the complexity.
Improved matrix algorithms via the subsampled randomized Hadamard transform
- SIAM J. Matrix Analysis Applications
"... Abstract. Several recent randomized linear algebra algorithms rely upon fast dimension reduc-tion methods. A popular choice is the subsampled randomized Hadamard transform (SRHT). In this article, we address the efficacy, in the Frobenius and spectral norms, of an SRHT-based low-rank matrix approxim ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Abstract. Several recent randomized linear algebra algorithms rely upon fast dimension reduc-tion methods. A popular choice is the subsampled randomized Hadamard transform (SRHT). In this article, we address the efficacy, in the Frobenius and spectral norms, of an SRHT-based low-rank matrix approximation technique introduced by Woolfe, Liberty, Rohklin, and Tygert. We establish a slightly better Frobenius norm error bound than is currently available, and a much sharper spectral norm error bound (in the presence of reasonable decay of the singular values). Along the way, we pro-duce several results on matrix operations with SRHTs (such as approximate matrix multiplication) that may be of independent interest. Our approach builds upon Tropp’s in “Improved Analysis of the
Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive Sampling
"... The CUR matrix decomposition and the Nyström approximation are two important low-rank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
(Show Context)
The CUR matrix decomposition and the Nyström approximation are two important low-rank matrix approximation techniques. The Nyström method approximates a symmetric positive semidefinite matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nyström approximation. In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nyström algorithms with expected relative-error bounds. The proposed CUR and Nyström algorithms also have low time complexity and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nyström method and the ensemble Nyström method. The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices.
Sharp analysis of low-rank kernel matrix approximations
- JMLR: WORKSHOP AND CONFERENCE PROCEEDINGS VOL 30 (2013) 1–25
, 2013
"... We consider supervised learning problems within the positive-definite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinite-dimensional feature spaces, a common practical limiting difficulty is the necessity of c ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We consider supervised learning problems within the positive-definite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinite-dimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n 2). Low-rank approximations of the kernel matrix are often considered as they allow the reduction of running time complexities to O(p 2 n), where p is the rank of the approximation. The practicality of such methods thus depends on the required rank p. In this paper, we show that in the context of kernel ridge regression, for approximations based on a random subset of columns of the original kernel matrix, the rank p may be chosen to be linear in the degrees of freedom associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of non-parametric estimators. This result enables simple algorithms that have sub-quadratic running time complexity, but provably exhibit the same predictive performance than existing algorithms, for any given problem instance, and not only for worst-case situations.
The Fast Cauchy Transform and Faster Robust Linear Regression
"... We provide fast algorithms for overconstrained ℓp regression and related problems: for an n × d input matrix A and vector b ∈ Rn, in O(nd log n) time we reduce the problem minx∈Rd ‖Ax − b‖p to the same problem with input matrix à of dimension s×d and corresponding ˜b of dimension s × 1. Here, à and ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
We provide fast algorithms for overconstrained ℓp regression and related problems: for an n × d input matrix A and vector b ∈ Rn, in O(nd log n) time we reduce the problem minx∈Rd ‖Ax − b‖p to the same problem with input matrix à of dimension s×d and corresponding ˜b of dimension s × 1. Here, à and ˜b are a coreset for the problem, consisting of sampled and rescaled rows of A and b; and s is independent of n and polynomial in d. Our results improve on the best previous algorithms when n ≫ d, for all p ∈ [1, ∞) except p = 2; in particular, they improve the O(nd 1.376+) running time of Sohler and Woodruff (STOC, 2011) for p = 1, that uses asymptotically fast matrix multiplication, and the
A statistical perspective on algorithmic leveraging
, 2013
"... One popular method for dealing with large-scale data sets is sampling. Using the empirical statis-tical leverage scores as an importance sampling distribution, the method of algorithmic leverag-ing samples and rescales data matrices to reduce the data size before performing computations on the subpr ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
One popular method for dealing with large-scale data sets is sampling. Using the empirical statis-tical leverage scores as an importance sampling distribution, the method of algorithmic leverag-ing samples and rescales data matrices to reduce the data size before performing computations on the subproblem. Existing work has focused on algorithmic issues, but none of it addresses sta-tistical aspects of this method. Here, we provide an effective framework to evaluate the statistical properties of algorithmic leveraging in the con-text of estimating parameters in a linear regres-sion model. In particular, for several versions of leverage-based sampling, we derive results for the bias and variance. We show that from the sta-tistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. This result is particularly striking, given the well-known result that, from the algorithmic perspective of worst-case analy-sis, leverage-based sampling provides uniformly superior worst-case algorithmic results, when compared with uniform sampling. Based on these theoretical results, we propose and analyze two new leveraging algorithms: one constructs a smaller least-squares problem with “shrinked” leverage scores (SLEV), and the other solves a smaller and unweighted (or biased) least-squares problem (LEVUNW). The empirical results indi-cate that our theory is a good predictor of prac-tical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance.
Iterative row sampling
- In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on
, 2013
"... There has been significant interest and progress recently in algorithms that solve regression problems involving tall and thin matrices in input sparsity time. These algorithms find shorter equivalent of a n × d matrix where n d, which allows one to solve a poly(d) sized problem instead. In practic ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
There has been significant interest and progress recently in algorithms that solve regression problems involving tall and thin matrices in input sparsity time. These algorithms find shorter equivalent of a n × d matrix where n d, which allows one to solve a poly(d) sized problem instead. In practice, the best performances are often obtained by invoking these routines in an iterative fashion. We show these iterative methods can be adapted to give theoretical guarantees comparable and better than the current state of the art. Our approaches are based on computing the importances of the rows, known as leverage scores, in an iterative manner. We show that alternating between computing a short matrix estimate and finding more accurate approximate leverage scores leads to a series of geometrically smaller instances. This gives an algorithm that runs in O(nnz(A) + dω+θ−2) time for any θ> 0, where the dω+θ term is comparable to the cost of solving a regression problem on the small approximation. Our results are built upon the close connection between randomized matrix algorithms, iterative methods, and graph sparsification. 1