Results 1  10
of
95
Random features for largescale kernel machines
 In Neural Infomration Processing Systems
, 2007
"... To accelerate the training of kernel machines, we propose to map the input data to a randomized lowdimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the f ..."
Abstract

Cited by 115 (3 self)
 Add to MetaCart
To accelerate the training of kernel machines, we propose to map the input data to a randomized lowdimensional feature space and then apply existing fast linear methods. Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in largescale classification and regression tasks linear machine learning algorithms that use these features outperform stateoftheart largescale kernel machines. 1
Protovalue functions: A laplacian framework for learning representation and control in markov decision processes
 Journal of Machine Learning Research
, 2006
"... This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by d ..."
Abstract

Cited by 66 (10 self)
 Add to MetaCart
This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called protovalue functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A threephased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using leastsquares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for outofsample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end.
FINDING STRUCTURE WITH RANDOMNESS: PROBABILISTIC ALGORITHMS FOR CONSTRUCTING APPROXIMATE MATRIX DECOMPOSITIONS
"... Abstract. Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful t ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
Abstract. Lowrank matrix approximations, such as the truncated singular value decomposition and the rankrevealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing lowrank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired lowrank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition
RELATIVEERROR CUR MATRIX DECOMPOSITIONS
 SIAM J. MATRIX ANAL. APPL
, 2008
"... Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of “components.” Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the ..."
Abstract

Cited by 39 (9 self)
 Add to MetaCart
Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of “components.” Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an m × n matrix A and a rank parameter k. In our first algorithm, C is chosen, and we let A ′ = CC + A, where C + is the Moore–Penrose generalized inverse of C. In our second algorithm C, U, R are chosen, and we let A ′ = CUR. (C and R are matrices that consist of actual columns and rows, respectively, of A, and U is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least 1 − δ, ‖A − A ′ ‖F ≤ (1 + ɛ) ‖A − Ak‖F, where Ak is the “best ” rankk approximation provided by truncating the SVD of A, and where ‖X‖F is the Frobenius norm of the matrix X. The number of columns of C and rows of R is a lowdegree polynomial in k, 1/ɛ, and log(1/δ). Both the Numerical Linear Algebra community and the Theoretical Computer Science community have studied variants
Fast Approximate Spectral Clustering
, 2009
"... Spectral clustering refers to a flexible class of clustering procedures that can produce highquality clusterings on small data sets but which has limited applicability to largescale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spe ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
Spectral clustering refers to a flexible class of clustering procedures that can produce highquality clusterings on small data sets but which has limited applicability to largescale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortionminimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the misclustering rate. We develop two concrete instances of our general framework, one based on local kmeans clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform kmeans by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes. 1
LargeScale Manifold Learning
"... This paper examines the problem of extracting lowdimensional manifold structure given millions of highdimensional face images. Specifically, we address the computational challenges of nonlinear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nod ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
This paper examines the problem of extracting lowdimensional manifold structure given millions of highdimensional face images. Specifically, we address the computational challenges of nonlinear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. Since most manifold learning techniques rely on spectral decomposition, we first analyze two approximate spectral decomposition techniques for large dense matrices (Nyström and Columnsampling), providing the first direct theoretical and empirical comparison between these techniques. We next show extensive experiments on learning lowdimensional embeddings for two large face datasets: CMUPIE (35 thousand faces) and a web dataset (18 million faces). Our comparisons show that the Nyström approximation is superior to the Columnsampling method. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMUPIE dataset. 1.
TENSORCUR DECOMPOSITIONS FOR TENSORBASED DATA
 SIAM J. MATRIX ANAL. APPL.
, 2008
"... Motivated by numerous applications in which the data may be modeled by a variable subscripted by three or more indices, we develop a tensorbased extension of the matrix CUR decomposition. The tensorCUR decomposition is most relevant as a data analysis tool when the data consist of one mode that i ..."
Abstract

Cited by 21 (6 self)
 Add to MetaCart
Motivated by numerous applications in which the data may be modeled by a variable subscripted by three or more indices, we develop a tensorbased extension of the matrix CUR decomposition. The tensorCUR decomposition is most relevant as a data analysis tool when the data consist of one mode that is qualitatively different from the others. In this case, the tensorCUR decomposition approximately expresses the original data tensor in terms of a basis consisting of underlying subtensors that are actual data elements and thus that have a natural interpretation in terms of the processes generating the data. Assume the data may be modeled as a (2+1)tensor, i.e., an m×n×p tensor A in which the first two modes are similar and the third is qualitatively different. We refer to each of the p different m × n matrices as “slabs ” and each of the mn different pvectors as “fibers.” In this case, the tensorCUR algorithm computes an approximation to the data tensor A that is of the form CUR, where C is an m×n×c tensor consisting of a small number c of the slabs, R is an r × p matrix consisting of a small number r of the fibers, and U is an appropriately defined and easily computed c × r encoding matrix. Both C and R may be chosen by randomly sampling either slabs or fibers according to a judiciously chosen and datadependent probability distribution, and both c and r depend on a rank parameter k, an error parameter ɛ, and a failure probability δ. Under
Modeling Transfer Relationships Between Learning Tasks for Improved Inductive Transfer
"... Abstract. In this paper, we propose a novel graphbased method for knowledge transfer. We model the transfer relationships between source tasks by embedding the set of learned source models in a graph using transferability as the metric. Transfer to a new problem proceeds by mapping the problem into ..."
Abstract

Cited by 20 (8 self)
 Add to MetaCart
Abstract. In this paper, we propose a novel graphbased method for knowledge transfer. We model the transfer relationships between source tasks by embedding the set of learned source models in a graph using transferability as the metric. Transfer to a new problem proceeds by mapping the problem into the graph, then learning a function on this graph that automatically determines the parameters to transfer to the new learning task. This method is analogous to inductive transfer along a manifold that captures the transfer relationships between the tasks. We demonstrate improved transfer performance using this method against existing approaches in several realworld domains. 1
Subspace sampling and relativeerror matrix approximation: Columnbased methods
 In Proc. of the 10th RANDOM
, 2006
"... Abstract. Given an m×n matrix A and an integer k less than the rank of A, the “best ” rank k approximation to A that minimizes the error with respect to the Frobenius norm is Ak, which is obtained by projecting A on the top k left singular vectors of A. While Ak is routinely used in data analysis, i ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
Abstract. Given an m×n matrix A and an integer k less than the rank of A, the “best ” rank k approximation to A that minimizes the error with respect to the Frobenius norm is Ak, which is obtained by projecting A on the top k left singular vectors of A. While Ak is routinely used in data analysis, it is difficult to interpret and understand it in terms of the original data, namely the columns and rows of A. For example, these columns and rows often come from some application domain, whereas the singular vectors are linear combinations of (up to all) the columns or rows of A. We address the problem of obtaining lowrank approximations that are directly interpretable in terms of the original columns or rows of A. Our main results are two polynomial time randomized algorithms that take as input a matrix A and return as output a matrix C, consisting of a “small ” (i.e., a lowdegree polynomial in k,1/ɛ, andlog(1/δ)) number of actual columns of A such that � A − CC + A � �F ≤ (1 + ɛ) �A − Ak � F with probability at least 1−δ. Our algorithms are simple, and they take time of the order of the time needed to compute the top k right singular vectors of A. In addition, they sample the columns of A via the method of “subspace sampling, ” sonamed since the sampling probabilities depend on the lengths of the rows of the top singular vectors and since they ensure that we capture entirely a certain subspace of interest.
Conditional random sampling: A sketchbased sampling technique for sparse data
 In NIPS
, 2006
"... We 1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In largescale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stag ..."
Abstract

Cited by 14 (8 self)
 Add to MetaCart
We 1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In largescale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using realworld data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations. 1