## Online learning for matrix factorization and sparse coding

### Cached

### Download Links

Citations: | 124 - 22 self |

### BibTeX

@MISC{Mairal_onlinelearning,

author = {Julien Mairal and Francis Bach and Ecole Normale Supérieure and Guillermo Sapiro},

title = {Online learning for matrix factorization and sparse coding},

year = {}

}

### OpenURL

### Abstract

Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set, adapting it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large datasets.

### Citations

2120 | Matrix computations - Golub, Loan - 1996 |

2105 | Regression Shrinkage and Selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...the ℓ1-sparse coding problem: l(x,D) △ = min α∈Rk 1 2 ||x − Dα||22 + λ||α||1, (2) where λ is a regularization parameter. This problem is also known as basis pursuit (Chen et al., 1999), or the Lasso (=-=Tibshirani, 1996-=-). 3 It is well known that the ℓ1 regularization yields a sparse solution for α, but there is no direct analytic link between the value of λ and the corresponding effective sparsity ||α||0. To prevent... |

833 | Least Angle Regression - Efron, Hastie, et al. - 2004 |

831 | Algorithms for non-negative matrix factorization
- Lee, Seung
- 2001
(Show Context)
Citation Context ...ferent matrix factorization problems are formulated in order to obtain a few interpretable basis elements from a set of data vectors. This includes non-negative matrix factorization and its variants (=-=Lee and Seung, 2001-=-; Hoyer, 2002, 2004), and sparse principal component analysis (Zou et al., 2006; d’Aspremont et al., 2007, 2008; Witten et al., 2009; Zass and Shashua, 2007). As shown in this paper, these problems ha... |

811 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ...s not require an arbitrary stopping criterion. 3.3 Dictionary Update inria-00408716, version 1 - 1 Aug 2009 Our algorithm for updating the dictionary uses block-coordinate descent with warm restarts (=-=Bertsekas, 1999-=-). One of its main advantages is that it is parameter free and does not require any learning rate tuning. Moreover, the procedure does not require to store all the vectors xi and αi, but only the matr... |

811 | A view of the EM algorithm that justifies incremental, sparse, and other variants, ser - Neal, Hinton - 1998 |

663 | Sparse coding with an overcomplete basis set: A strategy employed by v1
- Olshausen, Field
- 1997
(Show Context)
Citation Context ...ments may sometimes “look like” wavelets (or Gabor filters), they are tuned to the input images or signals, leading to much better results in practice. Most recent algorithms for dictionary learning (=-=Olshausen and Field, 1997-=-; Engan et al., 1999; Lewicki and Sejnowski, 2000; Aharon et al., 2006; Lee et al., 2007) are iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost... |

571 | Model selection and estimation in regression with grouped variables - Yuan, Lin - 2006 |

486 | From few to many: Illumination cone models for face recognition under variable lighting and pose - Georghiades, Belhumeur, et al. |

449 |
K-svd: An algorithm for designing overcomplete dictionaries for sparse representation
- Aharon, Elad, et al.
- 2006
(Show Context)
Citation Context ...to the input images or signals, leading to much better results in practice. Most recent algorithms for dictionary learning (Olshausen and Field, 1997; Engan et al., 1999; Lewicki and Sejnowski, 2000; =-=Aharon et al., 2006-=-; Lee et al., 2007) are iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost function under some constraints, and cannot efficiently deal with ver... |

426 | Regularization and variable selection via the elastic net - Zou, Hastie - 2005 |

365 | Matrix Differential Calculus, with applications in statistics and Econometrics - Magnus, Neudecker - 1988 |

358 |
Relations between two sets of variates
- Hotelling
- 1936
(Show Context)
Citation Context ...on measurements and a matrix Y in R n×q of CGH measurements. In order to analyze the correlation between these two sets of data, recent works have suggested the use of canonical correlation analysis (=-=Hotelling, 1936-=-), which solves 9 inria-00408716, version 1 - 1 Aug 2009 min u∈Rp ,v∈Rq cov(Xu,Yv) s.t. ||Xu||2 ≤ 1 and ||Yv||2 ≤ 1. (57) When X and Y are centered and normalized, it has been further shown that with ... |

330 | Non-negative matrix factorization with sparseness constraints - Hoyer - 2004 |

324 |
Perturbation Analysis of Optimization Problems
- Bonnans, Shapiro
- 2000
(Show Context)
Citation Context ...n in gradient descent convergence proofs (Bertsekas, 1999). Lemma 1 [Asymptotic variations of Dt]. Assume (A)–(C). Then, ( 1 ) Dt+1 − Dt = O t a.s. (20) Proof. This proof is inspired by Prop 4.32 of (=-=Bonnans and Shapiro, 2000-=-) on the Lipschitz regularity of solutions of optimization problems. Using assumption (B), for all t, the surrogate ˆ ft is strictly convex with a Hessian lower-bounded by κ1. Then, a short calculatio... |

321 |
Image denoising via sparse and redundant representations over learned dictionaries
- Elad, Aharon
- 2006
(Show Context)
Citation Context ... dictionary instead of a predefined one—based on wavelets (Mallat, 1999) for example—has recently led to state-ofthe-art results in numerous low-level signal processing tasks such as image denoising (=-=Elad and Aharon, 2006-=-; Mairal et al., 2008b), texture synthesis (Peyré, 2009) and audio processing (Grosse et al., 2007; Zibulevsky and Pearlmutter, 2001), as well as higher-level tasks such as image classification (Raina... |

290 | Learning overcomplete representations - Lewicki, Sejnowski - 2000 |

236 | Algorithms for simultaneous sparse approximation
- Tropp
- 2006
(Show Context)
Citation Context ...e literature under various names such as group sparsity or grouped variable selection (Cotter et al., 2005; Turlach et al., 2005; Yuan and Lin, 2006; Obozinski et al., 2009, 2008; Zhang et al., 2008; =-=Tropp et al., 2006-=-; Tropp, 2006). Let X = [x1,...,xq] ∈ R m×q be a set of signals. Suppose one wants to obtain sparse decompositions of the signals on the dictionary D that share the same active set (non-zero coefficie... |

230 |
Stochastic Approximation and Recursive Algorithms and Applications, volume 35 of Applications of Mathematics
- Kushner, Yin
- 2003
(Show Context)
Citation Context ... of these patches (roughly one per pixel and per frame). In this setting, online techniques based on stochastic approximations are an attractive alternative to batch methods (see, e.g., Bottou, 1998; =-=Kushner and Yin, 2003-=-; Shalev-Shwartz et al., 2009). For example, first-order stochastic gradient descent with projections on the constraint set (Kushner and Yin, 2003) is sometimes used for dictionary learning (see Aharo... |

223 | Efficient sparse coding algorithms
- Lee, Battle, et al.
- 2007
(Show Context)
Citation Context ...r signals, leading to much better results in practice. Most recent algorithms for dictionary learning (Olshausen and Field, 1997; Engan et al., 1999; Lewicki and Sejnowski, 2000; Aharon et al., 2006; =-=Lee et al., 2007-=-) are iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost function under some constraints, and cannot efficiently deal with very large training s... |

223 | Matrix factorization techniques for recommender systems - Koren, Bell, et al. - 2009 |

215 | Gradient methods for minimizing composite objective function. Core discussion paper - Nesterov - 2007 |

211 | Blind source separation by sparse decomposition in a signal dictionary - Zibulevsky, Pearlmutter - 2001 |

209 | Self-taught learning: Transfer learning from unlabeled data
- Raina, Battle, et al.
(Show Context)
Citation Context ... 2006; Mairal et al., 2008b), texture synthesis (Peyré, 2009) and audio processing (Grosse et al., 2007; Zibulevsky and Pearlmutter, 2001), as well as higher-level tasks such as image classification (=-=Raina et al., 2007-=-; Mairal et al., 2008a, 2009b; Bradley and Bagnell, 2009), showing that sparse learned models are well adapted to natural signals. Un1Mairal, Bach, Ponce & Sapiro inria-00408716, version 1 - 1 Aug 20... |

197 | Linear spatial pyramid matching using sparse coding for image classification
- Yang, Yu, et al.
(Show Context)
Citation Context ...al., 2007; Févotte et al., 2009; Zibulevsky and Pearlmutter, 2001), as well ashigher-level tasks suchas imageclassification (Rainaet al., 2007; Mairal et al., 2008a, 2009b; Bradley and Bagnell, 2009; =-=Yang et al., 2009-=-), showing that sparse learned models 1inria-00408716, version 2 - 10 Feb 2010 are well adapted to natural signals. Unlike decompositions based on principal component analysis and its variants, these... |

190 | Simultaneous analysis of lasso and Dantzig selector - Bickel, Ritov, et al. - 2009 |

181 | Pathwise coordinate optimization - Friedman, Hastie, et al. |

178 | A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49 434–448 (electronic). MR2353806 - JORDAN, I, et al. - 2007 |

175 | A new approach to variable selection in least squares problems - Osborne, Presnell, et al. - 2000 |

173 | der Vaart, Asymptotic Statistics - van - 2000 |

163 | Consistency of the group lasso and multiple kernel learning - Bach |

156 | Projected gradient methods for Nonnegative Matrix Factorization - Lin - 2007 |

154 | Sparsity and smoothness via the fused Lasso
- Tibshirani, Saunders, et al.
- 2005
(Show Context)
Citation Context ...gh. The combination of ℓ1 and ℓ2 constraints has also been proposed recently for the problem of matrix factorization by Witten et al. (2009), but in a slightly different setting. • The “fused lasso” (=-=Tibshirani et al., 2005-=-) constraints. When one is looking for a dictionary whose columns are sparse and piecewise-constant, a fused lasso regularization can be used. For a vector u in R m , we consider the ℓ1-norm of the co... |

154 | Sparse solutions to linear inverse problems with multiple measrement vectors
- Rao, Kreutz-Delgado
- 1998
(Show Context)
Citation Context ...from each other), and one may want to address the problem of simultaneous sparse coding, which appears also in the literature under various names such as group sparsity or grouped variable selection (=-=Cotter et al., 2005-=-; Turlach et al., 2005; Yuan and Lin, 2006; Obozinski et al., 2008; Zhang et al., 2008; Tropp et al., 2006; Tropp, 2006). Let X = [x1,... ,xn] ∈ R m×n be a set of signals. Suppose one wants to obtain ... |

152 | Acquiring linear subspaces for face recognition under variable lighting
- Lee, Ho, et al.
- 2005
(Show Context)
Citation Context ... pixels from the the MIT-CBCL Face Database #1 (Sung, 1996). • Dataset E is composed of n = 2,414 face images of size m = 192 × 168 pixels from the Extended Yale B Database (Georghiades et al., 2001; =-=Lee et al., 2005-=-). • Dataset F is composed of n = 100,000 natural image patches of size m = 16 × 16 pixels from the Pascal VOC’06 image database (Everingham et al., 2006). We have used the Matlab implementations of N... |

149 | Convex Analysis and Nonlinear Optimization: Theory and Examples
- Borwein, Lewis
- 2000
(Show Context)
Citation Context ...2) Since this inequality is true for all U, ∇ ˆ f∞(D∞) = ∇f(D∞). A first-order necessary optimality condition for D∞ being an optimum of ˆ f∞ is that −∇ ˆ f∞ is in the normal cone of the set C at D∞ (=-=Borwein and Lewis, 2006-=-). Therefore, this first-order necessary conditions is verified for f at D∞ as well. Since At,Bt are asymptotically close to their accumulation points, −∇f(Dt) is asymptotically close the normal cone ... |

149 | The tradeoffs of large scale learning
- Bottou, Bousquet
- 2007
(Show Context)
Citation Context ... iterative batch procedures, accessing the whole training set at each iteration in order to minimize a cost function under some constraints, and cannot efficiently deal with very large training sets (=-=Bottou and Bousquet, 2008-=-), or dynamic training data changing over time, such as video sequences. To address these issues, we propose an online approach that processes the signals, one at a time, or in mini-batches. This is p... |

145 | Sparse Principal Component Analysis - Zou, Hastie, et al. - 2006 |

121 | Sparse representation for color image restoration
- Mairal, Elad, et al.
- 2008
(Show Context)
Citation Context ...a predefined one—based on wavelets (Mallat, 1999) for example—has recently led to state-ofthe-art results in numerous low-level signal processing tasks such as image denoising (Elad and Aharon, 2006; =-=Mairal et al., 2008-=-b), texture synthesis (Peyré, 2009) and audio processing (Grosse et al., 2007; Zibulevsky and Pearlmutter, 2001), as well as higher-level tasks such as image classification (Raina et al., 2007; Mairal... |

120 | Nonnegative sparse coding
- Hoyer
- 2002
(Show Context)
Citation Context ...zation problems are formulated in order to obtain a few interpretable basis elements from a set of data vectors. This includes non-negative matrix factorization and its variants (Lee and Seung, 2001; =-=Hoyer, 2002-=-, 2004), and sparse principal component analysis (Zou et al., 2006; d’Aspremont et al., 2007, 2008; Witten et al., 2009; Zass and Shashua, 2007). As shown in this paper, these problems have strong sim... |

119 | Penalized regression: the bridge versus the lasso
- Fu
- 1998
(Show Context)
Citation Context ... Eq. (2) with fixed dictionary is an ℓ1-regularized linear leastsquares problem. A number of recent methods for solving this type of problems are based 7on coordinate descent with soft thresholding (=-=Fu, 1998-=-; Friedman et al., 2007; Wu and Lange, 2008). When the columns of the dictionary have low correlation, we have observed that these simple methods are very efficient. However, the columns of learned di... |

118 | Group Lasso with overlaps and graph Lasso
- Jacob, Obozinski, et al.
- 2009
(Show Context)
Citation Context ...onvergence results in these cases. Note also that convex smooth approximation of sparse regularizers (Bradley and Bagnell, 2009), or structured sparsity-inducing regularizers (Jenatton et al., 2009a; =-=Jacob et al., 2009-=-) could be used as well even though we have not tested them. 5.2 Using Different Constraint Sets for D In the previous subsection, we have claimed that our algorithm could be used with different regul... |

117 | Non-local sparse models for image restoration - Mairal, Bach, et al. - 2010 |

108 | Online dictionary learning for sparse coding - Mairal, Bach, et al. - 2009 |

106 | Supervised dictionary learning - Mairal, Ponce, et al. - 2009 |

101 | Structured Variable Selection with Sparsity-Inducing Norms - JENATTON, AUDIBERT, et al. |

95 | Discriminative learned dictionaries for local image analysis - Mairal, Bach, et al. |

91 | A Modified Principal Component Technique Based on the LASSO - Jolliffe, Trendafilov, et al. - 2003 |

90 |
A wavelet tour of signal processing, second edition
- Mallat
- 1999
(Show Context)
Citation Context ...ochastic optimization, non-negative matrix factorization. 1. Introduction The linear decomposition of a signal using a few atoms of a learned dictionary instead of a predefined one—based on wavelets (=-=Mallat, 1999-=-) for example—has recently led to state-ofthe-art results in numerous low-level signal processing tasks such as image denoising (Elad and Aharon, 2006; Mairal et al., 2008b), texture synthesis (Peyré,... |

87 | Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing - Obozinski, Taskar, et al. - 2009 |