## Convex and network flow optimization for structured sparsity

Venue: | JMLR |

Citations: | 16 - 6 self |

### BibTeX

@ARTICLE{Mairal_convexand,

author = {Julien Mairal and Guillaume Obozinski and Francis Bach},

title = {Convex and network flow optimization for structured sparsity},

journal = {JMLR},

year = {},

pages = {2011}

}

### OpenURL

### Abstract

We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2- or ℓ∞-norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches.

### Citations

4676 |
Matrix Analysis
- Horn, Johnson
- 1986
(Show Context)
Citation Context ... be respectively a subset of c columns and r rows of the original matrix X. The third matrix U∈R c×r is then given by C + XR + , where A + denotes a Moore-Penrose generalized inverse of the matrix A (=-=Horn and Johnson, 1990-=-). Such a matrix factorization is particularly appealing when the interpretability of the results matters (Mahoney and Drineas, 2009). For instance, when studying gene-expression datasets, it is easie... |

3667 |
L.: Convex Optimization
- BOYD, VANDENBERGHE
- 2004
(Show Context)
Citation Context ...his paper. 2.3 Convex Optimization Methods Proposed in the Literature Generic approaches to solve Eq. (2) mostly rely on subgradient descent schemes (see Bertsekas, 1999), and interior-point methods (=-=Boyd and Vandenberghe, 2004-=-). These generic tools do not scale well to large problems and/or do not naturally handle sparsity (the solutions they return may have small values but no “true” zeros). These two points prompt the ne... |

1832 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...f variables are selected to describe the data. Regularization by theℓ1-norm has emerged as a powerful tool for addressing this variable selection problem, relying on both a well-developed theory (see =-=Tibshirani, 1996-=-; Chen et al., 1999; Mallat, 1999; Bickel et al., 2009; Wainwright, 2009, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Needell and Tro... |

1652 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 2001
(Show Context)
Citation Context ...lected to describe the data. Regularization by theℓ1-norm has emerged as a powerful tool for addressing this variable selection problem, relying on both a well-developed theory (see Tibshirani, 1996; =-=Chen et al., 1999-=-; Mallat, 1999; Bickel et al., 2009; Wainwright, 2009, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Needell and Tropp, 2009; Combettes... |

1493 | Topographic independent component analysis - Hyvarinen, Hoyer, et al. |

928 |
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
- Olshausen, Field
- 1996
(Show Context)
Citation Context ...is theℓ1-norm, we obtain a classical formulation, which is known to produce dictionary elements that are reminiscent of Gabor-like functions, when the columns of Y are whitened natural image patches (=-=Olshausen and Field, 1996-=-). Another line of research tries to put a structure on decomposition coefficients instead of considering them as independent. Jenatton et al. (2010a, 2011) have for instance embedded dictionary eleme... |

797 | An experimental comparison of min-cut/max-flow algorithms for energy minimization - Boykov, Kolmogorov - 2004 |

787 |
Kernel Methods for Pattern Analysis
- Shawe-Taylor, Cristianini
- 2004
(Show Context)
Citation Context ...ssumptions: • f is differentiable with Lipschitz-continuous gradient. For machine learning problems, this hypothesis holds when f is for example the square, logistic or multi-class logistic loss (see =-=Shawe-Taylor and Cristianini, 2004-=-). • Ω is a sum ofℓ∞-norms. Even though theℓ2-norm is sometimes used in the literature (Jenatton et al., 2009), and is in fact used later in Section 4, the ℓ∞-norm is piecewise linear, and we take adv... |

752 | Least angle regression - Efron, Hastie, et al. |

741 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ... therefore relevant, but out of the scope of this paper. 2.3 Convex Optimization Methods Proposed in the Literature Generic approaches to solve Eq. (2) mostly rely on subgradient descent schemes (see =-=Bertsekas, 1999-=-), and interior-point methods (Boyd and Vandenberghe, 2004). These generic tools do not scale well to large problems and/or do not naturally handle sparsity (the solutions they return may have small v... |

679 | Johnstone.,“Adapting to unknown smoothness via wavelet shrinkage - Donoho, Iain - 1995 |

533 |
Networks flows
- Ahuja, Magnanti, et al.
- 1993
(Show Context)
Citation Context ... where V is a set of vertices, E ⊆ V ×V a set of arcs, s a source, and t a sink. For all arcs in E, we define a non-negative capacity constant, and as done classically in the network flow literature (=-=Ahuja et al., 1993-=-; Bertsekas, 1998), we define a flow as a non-negative function on arcs that satisfies capacity constraints on all arcs (the value of the flow on an arc is less than or equal to the arc capacity) and ... |

509 | Tarjan, “A new approach to the maximum flow problem
- Goldberg, E
- 1986
(Show Context)
Citation Context ... construct a feasible flow(ξ,ξ), satisfying additional capacity constraints equal to γ j on arc ( j,t), and whose cost matches this lower bound; this latter problem can be cast as a max-flow problem (=-=Goldberg and Tarjan, 1986-=-). If such a flow exists, the algorithm returns ξ = γ, the cost of the flow reaches the lower bound, and is therefore optimal. If such a flow does not exist, we have ξ ̸= γ, the lower bound is not ach... |

504 | Model selection and estimation in regression with grouped variables - Yuan, Lin |

365 | Fast iterative shrinkage-thresholding algorithm for linear inverse problems - Beck, Teboulle - 2009 |

346 | Cosamp: Iterative signal recovery from incomplete and inaccurate samples - Needell, Tropp - 2008 |

318 | Robust face recognition via sparse representation - Wright, Yang, et al. - 2009 |

304 | Compressive sensing - Baraniuk - 2007 |

268 |
Image denoising via sparse and redundant representations over learned dictionaries
- Elad, Aharon
- 2006
(Show Context)
Citation Context ...000 natural image patches of size m=12×12 pixels, for dictionaries of size p = 400. Adapting the dictionary to specific data has proven to be useful in many applications, including image restoration (=-=Elad and Aharon, 2006-=-; Mairal et al., 2009), learning image features in computer vision (Kavukcuoglu et al., 2009). The resulting optimization problem we are interested in can be written min X∈C,W∈R p×n n 1 ∑ 2 ‖yi− Xw i ... |

252 | Smooth minimization of non-smooth functions - Nesterov |

191 |
Gradient methods for minimizing composite objective function. Core discussion paper 2007/96
- Nesterov
- 2007
(Show Context)
Citation Context ...ng on both a well-developed theory (see Tibshirani, 1996; Chen et al., 1999; Mallat, 1999; Bickel et al., 2009; Wainwright, 2009, and references therein) and efficient algorithms (Efron et al., 2004; =-=Nesterov, 2007-=-; Beck and Teboulle, 2009; Needell and Tropp, 2009; Combettes and Pesquet, 2010). ∗. These authors contributed equally. †. When most of this work was conducted, all authors were affiliated to INRIA, W... |

189 | Distributed optimization and statistical learning via the alternating direction method of multipliers - Boyd, Parikh, et al. - 2011 |

185 | Simultaneous analysis of lasso and dantzig selector - Bickel, Ritov, et al. |

168 | Sparse reconstruction by separable approximation - Wright, Nowak, et al. |

150 | On implementing push-relabel method for the maximum flow problem
- Cherkassy, Goldberg
- 1995
(Show Context)
Citation Context ...t up in practice; see Goldberg and Tarjan (1986) and Cherkassky and Goldberg (1997). We use the so-called “highest-active vertex selection rule, global and gap heuristics” (Goldberg and Tarjan, 1986; =-=Cherkassky and Goldberg, 1997-=-), which has a worst-case complexity of O(|V| 2 |E| 1/2 ) for a graph (V,E,s,t). This algorithm leverages the concept of pre-flow that relaxes the definition of flow and allows vertices to have a posi... |

150 | Submodular functions and optimization - Fujishige - 2005 |

141 | Convex multi-task feature learning - Argyriou, Evgeniou, et al. - 2008 |

136 | Convex Analysis and Nonlinear Optimization, Theory and Examples
- Borwein, Lewis
- 2000
(Show Context)
Citation Context ...ever significantly different than their theoretical worst-case complexities (see Boykov and Kolmogorov, 2004). 13MAIRAL, JENATTON, OBOZINSKI AND BACH We now denote by f ∗ the Fenchel conjugate of f (=-=Borwein and Lewis, 2006-=-), defined by f ∗ (κ) � sup z [z ⊤ κ− f(z)]. The duality gap for problem (2) can be derived from standard Fenchel duality arguments (Borwein and Lewis, 2006) and it is equal to f(w)+λΩ(w)+ f ∗ (−κ) fo... |

129 | Network Optimization: Continuous and Discrete Models, Athena Scientific
- Bertsekas
- 1998
(Show Context)
Citation Context ... vertices, E ⊆ V ×V a set of arcs, s a source, and t a sink. For all arcs in E, we define a non-negative capacity constant, and as done classically in the network flow literature (Ahuja et al., 1993; =-=Bertsekas, 1998-=-), we define a flow as a non-negative function on arcs that satisfies capacity constraints on all arcs (the value of the flow on an arc is less than or equal to the arc capacity) and conservation cons... |

125 |
A fast parametric maximum flow algorithm and applications
- Gallo, Grigoriadis, et al.
- 1989
(Show Context)
Citation Context ... groups is more difficult. Hochbaum and Hong (1995) have shown that quadratic min-cost flow problems can be reduced to a specific parametric max-flow problem, for which an efficient algorithm exists (=-=Gallo et al., 1989-=-). 9 While this generic approach could be used to solve Eq. (6), we propose to use Algorithm 1 that also exploits the fact that our graphs have non-zero costs only on edges leading to the sink. As sho... |

113 | Convergence of a block coordinate descent method for nondifferentiable minimization - Tseng |

109 | Group Lasso with overlap and graph Lasso
- JACOB, OBOZINSKI, et al.
- 2009
(Show Context)
Citation Context ...cently been devoted to designing sparsity-inducing regularizations capable of encoding higherorder information about the patterns of non-zero coefficients (Cehver et al., 2008; Jenatton et al., 2009; =-=Jacob et al., 2009-=-; Zhao et al., 2009; He and Carin, 2009; Huang et al., 2009; Baraniuk et al., 2010; Micchelli et al., 2010), with successful applications in bioinformatics (Jacob et al., 2009; Kim and Xing, 2010), to... |

105 |
Non-local sparse models for image restoration
- Mairal, Bach, et al.
- 2009
(Show Context)
Citation Context ...es of size m=12×12 pixels, for dictionaries of size p = 400. Adapting the dictionary to specific data has proven to be useful in many applications, including image restoration (Elad and Aharon, 2006; =-=Mairal et al., 2009-=-), learning image features in computer vision (Kavukcuoglu et al., 2009). The resulting optimization problem we are interested in can be written min X∈C,W∈R p×n n 1 ∑ 2 ‖yi− Xw i ‖ 2 2 + λΩ(wi ), (14)... |

104 |
Maximum flow through a network,” Canadian
- Ford, Fulkerson
- 1956
(Show Context)
Citation Context ...he cost of the flow reaches the lower bound, and is therefore optimal. If such a flow does not exist, we have ξ ̸= γ, the lower bound is not achievable, and we build a minimum (s,t)-cut of the graph (=-=Ford and Fulkerson, 1956-=-) defining two disjoints sets of nodes V + and V − ; V + is the part of the graph which is reachable from the source (for every node j in V + , there exists a nonsaturated path from s to j), whereas a... |

98 | adaptive wavelet estimation: a block thresholding and oracle inequality approach,” The annuals of Statistics
- W, Cai
- 1999
(Show Context)
Citation Context ...avelet domain. This regularization encourages neighboring wavelet coefficients to be set to zero together, which was also exploited in the past in block-thresholding approaches for wavelet denoising (=-=Cai, 1999-=-). We call this norm Ωgrid. We consider Daubechies3 wavelets (see Mallat, 1999) for the matrix X, use 12 classical standard test images, 17 and generate noisy versions of them corrupted by a white Gau... |

98 | Online learning for matrix factorization and sparse coding - Mairal, Bach, et al. |

86 | Proximal splitting methods in signal processing,” in Fixed-Point Algorithms for Inverse Problems in Science
- Combettes, Pesquet
- 2010
(Show Context)
Citation Context ...al., 1999; Mallat, 1999; Bickel et al., 2009; Wainwright, 2009, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Needell and Tropp, 2009; =-=Combettes and Pesquet, 2010-=-). ∗. These authors contributed equally. †. When most of this work was conducted, all authors were affiliated to INRIA, WILLOW Project-Team. ©2011 Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski... |

81 | Joint covariate selection and joint subspace selection for multiple classification problems - Obozinski, Taskar, et al. - 2010 |

72 | A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers
- Negahban, Ravikumar, et al.
(Show Context)
Citation Context ...ny respects. For instance, dual norms are central in working-set algorithms (Jenatton et al., 2009; Bach et al., 2011), and arise as well when proving theoretical estimation or prediction guarantees (=-=Negahban et al., 2009-=-). In our context, we use it to monitor the convergence of the proximal method through a duality gap, hence defining a proper optimality criterion for problem (2). As a brief reminder, the duality gap... |

72 |
Simultaneous variable selection
- TURLACH, VENABLES, et al.
- 2005
(Show Context)
Citation Context ...indeed possible to encode additional knowledge in the regularization other than just sparsity. For instance, one may want the non-zero patterns to be structured in the form of non-overlapping groups (=-=Turlach et al., 2005-=-; Yuan and Lin, 2006; Stojnic et al., 2009; Obozinski et al., 2010), in a tree (Zhao et al., 2009; Bach, 2009; Jenatton et al., 2010a, 2011), or in overlapping groups (Jenatton et al., 2009; Jacob et ... |

71 | Proximal methods for sparse hierarchical dictionary learning - Jenatton, Mairal, et al. - 2010 |

68 | Learning invariant features through topographic filter maps
- Kavukcuoglu, Ranzato, et al.
- 2009
(Show Context)
Citation Context ... sparsity. 2CONVEX AND NETWORK FLOW OPTIMIZATION FOR STRUCTURED SPARSITY noising with a structured sparse prior, and topographic dictionary learning of natural image patches (Hyvärinen et al., 2001; =-=Kavukcuoglu et al., 2009-=-; Garrigues and Olshausen, 2010). Note that this paper extends a shorter version published in Advances in Neural Information Processing Systems (Mairal et al., 2010b), by adding new experiments (CUR m... |

68 | Fonctions convexes duales et points proximaux dans un espace hilbertien - Moreau - 1962 |

66 | Efficient projections onto the ℓ1-ball for learning in high dimensions
- Duchi, Shalev-Shwartz, et al.
- 2008
(Show Context)
Citation Context ...ong, 1995). One of the simplest cases, whereG contains a single group as in Figure 1a, is solved by an orthogonal projection on the ℓ1-ball of radius ληg. It has been shown, both in machine learning (=-=Duchi et al., 2008-=-) and operations research (Hochbaum and Hong, 1995; Brucker, 1984), that such a projection can be computed in O(p) operations. When the group structure is a tree as in Figure 1d, strategies developed ... |

66 |
A Wavelet Tour of Signal Processing, second edition
- Mallat
- 1999
(Show Context)
Citation Context ...the data. Regularization by theℓ1-norm has emerged as a powerful tool for addressing this variable selection problem, relying on both a well-developed theory (see Tibshirani, 1996; Chen et al., 1999; =-=Mallat, 1999-=-; Bickel et al., 2009; Wainwright, 2009, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Needell and Tropp, 2009; Combettes and Pesquet, ... |

62 | The benefit of group sparsity - Huang, Zhang |

61 | On the reconstruction of block-sparse signals with an optimal number of measurements - Stojnic, Parvaresh, et al. - 2009 |

59 |
Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-constrained quadratic programming
- Wainwright
- 2006
(Show Context)
Citation Context ...orm has emerged as a powerful tool for addressing this variable selection problem, relying on both a well-developed theory (see Tibshirani, 1996; Chen et al., 1999; Mallat, 1999; Bickel et al., 2009; =-=Wainwright, 2009-=-, and references therein) and efficient algorithms (Efron et al., 2004; Nesterov, 2007; Beck and Teboulle, 2009; Needell and Tropp, 2009; Combettes and Pesquet, 2010). ∗. These authors contributed equ... |

58 | Learning with structured sparsity
- Huang, Zhang, et al.
- 2009
(Show Context)
Citation Context ...zations capable of encoding higherorder information about the patterns of non-zero coefficients (Cehver et al., 2008; Jenatton et al., 2009; Jacob et al., 2009; Zhao et al., 2009; He and Carin, 2009; =-=Huang et al., 2009-=-; Baraniuk et al., 2010; Micchelli et al., 2010), with successful applications in bioinformatics (Jacob et al., 2009; Kim and Xing, 2010), topic modeling (Jenatton et al., 2010a, 2011) and computer vi... |

44 | Convex optimization with sparsity-inducing norms,” Optimization Mach
- Bach, Jenatton, et al.
- 2011
(Show Context)
Citation Context ... subset of variables known as the working set. As long as some predefined optimality conditions are not satisfied, the working set is augmented with selected inactive variables (for more details, see =-=Bach et al., 2011-=-). The last approach we would like to mention is that of Chen et al. (2010), who used a smoothing technique introduced by Nesterov (2005). A smooth approximation Ωµ of Ω is used, when Ω is a sum of ℓ2... |