## On learning discrete graphical models using greedy methods (2011)

Venue: | In Neural Information Processing Systems (NIPS) (currently under review |

Citations: | 9 - 4 self |

### BibTeX

@INPROCEEDINGS{Jalali11onlearning,

author = {Ali Jalali and Christopher C. Johnson and Pradeep Ravikumar},

title = {On learning discrete graphical models using greedy methods},

booktitle = {In Neural Information Processing Systems (NIPS) (currently under review},

year = {2011}

}

### OpenURL

### Abstract

In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegreed and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling asn = Ω(d 2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d 3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end. 1

### Citations

1675 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ... performs just the forward step has appeared in various guises in multiple communities: in machine learning as boosting [13], in function approximation [24], and in signal processing as basis pursuit =-=[6]-=-. In the context of statistical model estimation, Zhang [28] analyzed the forward greedy algorithm for the case of sparse linear regression; and showed that the forward greedy algorithm is sparsistent... |

1225 | Additive logistic regression: a statistical view of boosting
- Friedman, Hastie, et al.
(Show Context)
Citation Context ...estimate after a finite number of greedy steps. The forward greedy variant which performs just the forward step has appeared in various guises in multiple communities: in machine learning as boosting =-=[13]-=-, in function approximation [24], and in signal processing as basis pursuit [6]. In the context of statistical model estimation, Zhang [28] analyzed the forward greedy algorithm for the case of sparse... |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...y Local Search. Methods for estimating such graph structure include those based on constraint and hypothesis testing [22], and those that estimate restricted classes of graph structures such as trees =-=[8]-=-, polytrees [11], and hypertrees [23]. A recent class of successful approaches for graphical model structure learning are based on estimating the local neighborhood of each node. One subclass of these... |

497 |
Causation, Prediction and Search
- Spirtes, Glymour, et al.
- 2000
(Show Context)
Citation Context ... range of applications of MRFs. Existing approaches: Neighborhood Estimation, Greedy Local Search. Methods for estimating such graph structure include those based on constraint and hypothesis testing =-=[22]-=-, and those that estimate restricted classes of graph structures such as trees [8], polytrees [11], and hypertrees [23]. A recent class of successful approaches for graphical model structure learning ... |

426 | The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics 35:2313–23516
- Candès, Tao
- 2007
(Show Context)
Citation Context ...sed on the model space. Of relevance to graphical model structure learning is the structure of sparsity, where a sparse set of non-zero parameters entail a sparse set of edges. A surge of recent work =-=[5, 12]-=- has shown that ℓ1-regularization for learning such sparse models can lead to practical algorithms with strong theoretical guarantees. A line of recent work (cf. paragraph above) has thus leveraged th... |

384 | High-dimensional graphs and variable selection with the Lasso
- Meinshausen, Bühlmann
(Show Context)
Citation Context ...grows at least as quickly as O(p d ), where d is the maximum neighborhood size in the graphical model [1, 4, 9]. Another subclass use convex programs to learn the neighborhood structure: for instance =-=[20, 17, 16]-=- estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional likelihood; [15, 10] use ℓ1/ℓ2-regularized conditional likelihood. Even these methods, however need to ... |

155 | Learning Bayesian networks is np-complete
- Chickering
- 1996
(Show Context)
Citation Context ...blems. Another popular class of approaches are based on using a score metric and searching for the best scoring structure from a candidate set of graph structures. Ex1act search is typically NP-hard =-=[7]-=-; indeed for general discrete MRFs, not only is the search space intractably large, but calculation of typical score metrics itself is computationally intractable since they involve computing the part... |

102 | D.: Efficient structure learning of Markov networks using l1-regularization
- Lee, Ganapathi, et al.
- 2007
(Show Context)
Citation Context ...grows at least as quickly as O(p d ), where d is the maximum neighborhood size in the graphical model [1, 4, 9]. Another subclass use convex programs to learn the neighborhood structure: for instance =-=[20, 17, 16]-=- estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional likelihood; [15, 10] use ℓ1/ℓ2-regularized conditional likelihood. Even these methods, however need to ... |

78 | Sparse permutation invariant covariance estimation
- Rothman, Bickel, et al.
(Show Context)
Citation Context ...ain lemmas: Lemmas 1 and 2 which are simple consequences of the forward and backward steps failing when the greedy algorithm stops, and Lemma 3 which uses these two lemmas and extends techniques from =-=[21]-=- and [19] to obtain anℓ2 error bound on the error. Provided these lemmas hold, we then show below that the greedy algorithm is sparsistent. However, these lemmas require apriori that the RSC and RSS c... |

72 | Supplement to “a unified framework for high-dimensional analysis of M-estimators with decomposable regularizers
- Negahban, Ravikumar, et al.
- 2012
(Show Context)
Citation Context ...in a finite number of steps. We state the assumptions on the loss function such that sparsistency is guaranteed. Let us first recall the definition of restricted strong convexity from Negahban et al. =-=[18]-=-. Specifically, for a given setS, the loss function is said to satisfy restricted strong convexity (RSC) with parameterκl with respect to the set S if L(θ +∆;Z n 1)−L(θ;Z n 1)−〈∇L(θ;Z n 1),∆〉 ≥ κl 2 ‖... |

56 | High-dimensional generalized linear models and the lasso
- Geer
(Show Context)
Citation Context ...he third order terms in the Taylor series expansion of the log-likelihood, that in turn requires the estimate to be well-behaved. Such extensions in the case ofℓ1-regularization occur for instance in =-=[20, 25, 3]-=-. Our Contributions. In this paper, we address both questions above. In the first part, we analyze the forward backward greedy algorithm [28] for general statistical models. We note that even though w... |

51 | Adaptive forward-backward greedy algorithm for sparse learning with linear models
- Zhang
- 2008
(Show Context)
Citation Context ...nd showed that the forward greedy algorithm is sparsistent (consistent for model selection recovery) under the same “irrepresentable” condition as that required for “sparsistency” of the Lasso. Zhang =-=[27]-=- analyzes a more general greedy algorithm for sparse linear regression that performs forward and backward steps, and showed that it is sparsistent under a weaker restricted eigenvalue condition. Here ... |

45 | Learning factor graphs in polynomial time and sample complexity
- Abbeel, Koller, et al.
(Show Context)
Citation Context ...bounded degree graphs involve the use of exhaustive search so that their computational complexity grows at least as quickly as O(p d ), where d is the maximum neighborhood size in the graphical model =-=[1, 4, 9]-=-. Another subclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional li... |

41 | Maximum Likelihood Bounded Tree-width Markov Networks
- Srebro
- 2001
(Show Context)
Citation Context ...ng such graph structure include those based on constraint and hypothesis testing [22], and those that estimate restricted classes of graph structures such as trees [8], polytrees [11], and hypertrees =-=[23]-=-. A recent class of successful approaches for graphical model structure learning are based on estimating the local neighborhood of each node. One subclass of these for the special case of bounded degr... |

38 | High-dimensional Ising model selection using ℓ1-regularized logistic regression
- Ravikumar, Wainwright, et al.
(Show Context)
Citation Context ... \s Remarks. The sufficient condition on the parameters imposed by the greedy algorithm is a restricted strong convexity condition [19], which is weaker than the irrepresentable condition required by =-=[20]-=-. Further, the number of samples required for sparsistent graph recovery scales as O(d 2 logp), where d is the maximum node degree, in contrast to O(d 3 logp) for the ℓ1 regularized counterpart. We co... |

33 |
Learning polytrees
- Dasgupta
- 1999
(Show Context)
Citation Context ... Methods for estimating such graph structure include those based on constraint and hypothesis testing [22], and those that estimate restricted classes of graph structures such as trees [8], polytrees =-=[11]-=-, and hypertrees [23]. A recent class of successful approaches for graphical model structure learning are based on estimating the local neighborhood of each node. One subclass of these for the special... |

31 | Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms
- Bresler, Mossel, et al.
- 2008
(Show Context)
Citation Context ...bounded degree graphs involve the use of exhaustive search so that their computational complexity grows at least as quickly as O(p d ), where d is the maximum neighborhood size in the graphical model =-=[1, 4, 9]-=-. Another subclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional li... |

29 |
On the consistency of feature selection using greedy least squares regression
- Zhang
(Show Context)
Citation Context ...ses in multiple communities: in machine learning as boosting [13], in function approximation [24], and in signal processing as basis pursuit [6]. In the context of statistical model estimation, Zhang =-=[28]-=- analyzed the forward greedy algorithm for the case of sparse linear regression; and showed that the forward greedy algorithm is sparsistent (consistent for model selection recovery) under the same “i... |

28 |
Maximal sparsity representation via ℓ1 minimization
- Donoho, Elad
(Show Context)
Citation Context ...sed on the model space. Of relevance to graphical model structure learning is the structure of sparsity, where a sparse set of non-zero parameters entail a sparse set of edges. A surge of recent work =-=[5, 12]-=- has shown that ℓ1-regularization for learning such sparse models can lead to practical algorithms with strong theoretical guarantees. A line of recent work (cf. paragraph above) has thus leveraged th... |

22 |
Greedy Approximation
- Temlyakov
- 2011
(Show Context)
Citation Context ...f greedy steps. The forward greedy variant which performs just the forward step has appeared in various guises in multiple communities: in machine learning as boosting [13], in function approximation =-=[24]-=-, and in signal processing as basis pursuit [6]. In the context of statistical model estimation, Zhang [28] analyzed the forward greedy algorithm for the case of sparse linear regression; and showed t... |

20 | Self-concordant analysis for logistic regression
- Bach
(Show Context)
Citation Context ...he third order terms in the Taylor series expansion of the log-likelihood, that in turn requires the estimate to be well-behaved. Such extensions in the case ofℓ1-regularization occur for instance in =-=[20, 25, 3]-=-. Our Contributions. In this paper, we address both questions above. In the first part, we analyze the forward backward greedy algorithm [28] for general statistical models. We note that even though w... |

7 |
Consistent estimation of the basic neighborhood structure of Markov random fields
- Csiszár, Talata
- 2006
(Show Context)
Citation Context ...bounded degree graphs involve the use of exhaustive search so that their computational complexity grows at least as quickly as O(p d ), where d is the maximum neighborhood size in the graphical model =-=[1, 4, 9]-=-. Another subclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional li... |

7 |
Beitrag zur theorie der ferromagnetismus. Zeitschrift für Physik 31
- Ising
- 1925
(Show Context)
Citation Context ... P(x) ∝ exp {∑ θr(xr)+ ∑ θrt(xr,xt) } . (1) r∈V (r,t)∈E 2In this paper, we largely focus on the case where the variables are binary with X = {−1,+1}, where we can rewrite (1) to the Ising model form =-=[14]-=- for some set of parameters{θr} and{θrt} as P(x) ∝ exp {∑ θrxr + ∑ } . (2) 2.2 Graphical Model Selection r∈V (r,t)∈E θrtxrxt Let D := {x (1) ,...,x (n) } denote the set of n samples, where each p-dime... |

6 | On learning discrete graphical models using group-sparse regularization
- Jalali, Ravikumar, et al.
- 2011
(Show Context)
Citation Context ...subclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional likelihood; =-=[15, 10]-=- use ℓ1/ℓ2-regularized conditional likelihood. Even these methods, however need to solve regularized convex programs with typically polynomial computational cost of O(p 4 ) or O(p 6 ), are still expen... |

3 |
Convergence rates of gradient methods for high-dimensional statistical recovery
- Agarwal, Negahban, et al.
- 2010
(Show Context)
Citation Context ...C, RSS. We first note that the conditional log-likelihood loss function in (4) corresponds to a logistic likelihood. Moreover, the covariates are all binary, and bounded, and hence also sub-Gaussian. =-=[19, 2]-=- analyze the RSC and RSS properties of generalized linear models, of which logistic models are an instance, and show that the following result holds if the covariates are sub-Gaussian. Let ∂L(∆;θ∗ ) =... |

3 |
Decomposition and model selection for large contingency tables
- Dahinden, Kalisch, et al.
- 2010
(Show Context)
Citation Context ...subclass use convex programs to learn the neighborhood structure: for instance [20, 17, 16] estimate the neighborhood set for each vertexr ∈ V by optimizing its ℓ1-regularized conditional likelihood; =-=[15, 10]-=- use ℓ1/ℓ2-regularized conditional likelihood. Even these methods, however need to solve regularized convex programs with typically polynomial computational cost of O(p 4 ) or O(p 6 ), are still expen... |