#### DMCA

## Efficient structure learning of Markov networks using L1regularization (2006)

### Cached

### Download Links

- [www.seas.upenn.edu]
- [ai.stanford.edu]
- [ai.stanford.edu]
- [ai.stanford.edu]
- [www.stanford.edu]
- [robotics.stanford.edu]
- [ai.stanford.edu]
- [robotics.stanford.edu]
- [ai.stanford.edu]
- [robotics.stanford.edu]
- [ai.stanford.edu]
- [books.nips.cc]
- [engr.case.edu]
- [machinelearning.wustl.edu]
- [varoon.stanford.edu]

Venue: | In NIPS |

Citations: | 143 - 3 self |

### Citations

8774 |
Probabilistic reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...l to simply initialize the model to include all possible features: exact inference in such a model is almost invariably intractable, and approximate inference methods such as loopy belief propagation =-=[17]-=- are likely to give highly inaccurate estimates of the gradient, leading to poorly learned models. Thus, we propose an algorithm schema that gradually introduces features into the model, and lets the ... |

4061 | Regression shrinkage and selection via the LASSO.
- Tibshirani
- 1996
(Show Context)
Citation Context ...se a technique that has become increasingly popular in the context of supervised learning, in problems that involve a large number of features, many of which may be irrelevant. It has been long known =-=[21]-=- that using L1-regularization over the model parameters — optimizing a joint objective that trades off fit to data with the sum of the absolute values of the parameters — tends to lead to sparse model... |

1475 | Gradient-based learning applied to document recognition.
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...y reasonable feature introduction method. We test our method on synthetic data generated from known MRFs and on two real-world tasks: modeling the joint distribution of pixel values in the MNIST data =-=[12]-=-, and modeling the joint distribution of genetic sequence variations — single-nucleotide polymorphisms (SNPs) — in the human HapMap data [3]. Our results show that L1-regularization out-performs other... |

766 |
Probabilistic Networks and Expert Systems.
- Cowell, David, et al.
- 1999
(Show Context)
Citation Context ... is calibrated, in that all of the belief potentials must agree [23]. Thus, we can view the subtree as a calibrated clique tree, and use standard dynamic programming methods over the tree (see, e.g.. =-=[4]-=-) to extract an approximate joint distribution over X k. We note that this computation is exact for tree-structured cluster graphs, but approximate otherwise, and that the choice of tree is not obviou... |

665 | Loopy belief propagation for approximate inference: An empirical study.
- Murphy, Weiss, et al.
- 1999
(Show Context)
Citation Context ...fining a feature which is an indicator function for every assignment of variables xc to a clique Xc. The mapping to Markov networks is useful, as most inference algorithms, such as belief propagation =-=[17, 16]-=-, operate on the graph structure of the Markov network. The standard learning problem for MRFs is formulated as follows. We are given a set of IID training instances D = {x[1], . . . , x[M]}, each con... |

659 | Inducing features of random fields.
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ...is considerably harder. The dominant type of solution to this problem uses greedy local heuristic search, which incrementally modifies the model by adding and possibly deleting features.sOne approach =-=[6, 14]-=- adds features so as to greedily improve the model likelihood; once a feature is added, it is never removed. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unn... |

578 | Uncertainly principle and ideal atomic decomposition.
- DONOHO, HUO
- 2011
(Show Context)
Citation Context ...limit, redundant features are eventually eliminated, so that the learning eventually converges to a minimal structure consistent with the underlying distribution. Similar results were shown by Donoho =-=[8]-=-, and can perhaps be adapted to this case. A key limiting factor in MRF learning, and in our approach, is the fact that it requires inference over the model. While our experiments suggest that approxi... |

469 | Generalized belief propagation.
- Yedidia, Freeman, et al.
- 2001
(Show Context)
Citation Context ...e, which results in errors in the gradient. While many approximate inference methods have been proposed, one of the most commonly used is the general class of loopy belief propagation (BP) algorithms =-=[17, 16, 24]-=-. The use of an approximate inference algorithm such as BP raises several important points. One important question issue relates to the computation of the gradient or the gain for features that are cu... |

443 | Multiple kernel learning, conic duality, and the SMO algorithm.
- Bach, Lanckriet, et al.
- 2004
(Show Context)
Citation Context ...e of the induced Markov network. We can extend our approach to introduce such a bias by using a variant of L1 regularization that penalizes blocks of parameters together, such as the block-L1-norm of =-=[2]-=-. From a theoretical perspective, it would be interesting to show that, at the large sample limit, redundant features are eventually eliminated, so that the learning eventually converges to a minimal ... |

228 | Efficiently inducing features of conditional random fields.
- McCallum
- 2003
(Show Context)
Citation Context ...is considerably harder. The dominant type of solution to this problem uses greedy local heuristic search, which incrementally modifies the model by adding and possibly deleting features.sOne approach =-=[6, 14]-=- adds features so as to greedily improve the model likelihood; once a feature is added, it is never removed. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unn... |

197 | Feature selection, L1 vs. L2 regularization, and rotational invariance. - Ng - 2004 |

185 | Large-scale Bayesian logistic regression for text categorization,”
- Genkin, Lewis, et al.
- 2007
(Show Context)
Citation Context ...this approach is useful for selecting the relevant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., =-=[18, 10, 9]-=-), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the pu... |

172 | 1-norm support vector machines.
- Zhu, Rosset, et al.
- 2003
(Show Context)
Citation Context ...evant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., [18, 10, 9]), support vector machines (e.g., =-=[25]-=-), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the purpose of structure learning in general... |

106 | Grafting: Fast, incremental feature selection by gradient descent in function space
- Perkins, Lacker, et al.
- 2003
(Show Context)
Citation Context ...this approach is useful for selecting the relevant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., =-=[18, 10, 9]-=-), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the pu... |

104 | Loopy belief propagation: Convergence and effects of message errors
- Ihler, Iii, et al.
- 2005
(Show Context)
Citation Context ...s they return are typically much less accurate. Moreover, non-convergenceof the inference is more common when the network parameters are allowed to take larger, more extreme values; see, for example, =-=[20, 11, 13]-=- for some theoretical results supporting this empirical phenomenon. Thus, it is important to keep the model amenable to approximate inference, and thereby continue to improve, for as long as possible.... |

81 | Performance guarantees for regularized maximum entropy density estimation. - Dudik, Phillips, et al. - 2004 |

73 | Exponential priors for maximum entropy models
- Goodman
(Show Context)
Citation Context ...this approach is useful for selecting the relevant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., =-=[18, 10, 9]-=-), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the pu... |

62 | Thin junction trees.
- Bach, Jordan
- 2001
(Show Context)
Citation Context ...emoved. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unnecessary features, and thereby to overly complex structures and overfitting. An alternative approach =-=[1, 7]-=- explicitly searches over the space of low-treewidth models, but the utility of such models in practice is unclear; indeed, hand-designed models for real-world problems generally do not have low tree-... |

59 |
Algorithms for maximum-likelihood logistic regression
- Minka
- 2001
(Show Context)
Citation Context ... The key difficulty is that the maximum likelihood (ML) parameters of these networks have no analytic closed form; finding these parameters requires an iterative procedure (such as conjugate gradient =-=[15]-=- or BFGS [5]), where each iteration runs inference over the current model. This type of procedure is computationally expensive even for models where inference is tractable. The problem of structure le... |

33 | Efficient stepwise selection in decomposable models
- Deshpande, Garofalakis, et al.
- 2001
(Show Context)
Citation Context ...emoved. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unnecessary features, and thereby to overly complex structures and overfitting. An alternative approach =-=[1, 7]-=- explicitly searches over the space of low-treewidth models, but the utility of such models in practice is unclear; indeed, hand-designed models for real-world problems generally do not have low tree-... |

27 | Tree-based modeling and estimation of Gaussian processes on graphs with cycles
- Wainwright, Sudderth, et al.
- 2000
(Show Context)
Citation Context ...he calibrated loopy graph that contains all of the variables in Xk. At convergence of the BP algorithm, every subtree of the loopy graph is calibrated, in that all of the belief potentials must agree =-=[23]-=-. Thus, we can view the subtree as a calibrated clique tree, and use standard dynamic programming methods over the tree (see, e.g.. [4]) to extract an approximate joint distribution over X k. We note ... |

17 |
Notes on CG and LM-BFGS optimization of logistic regression
- Daumé
(Show Context)
Citation Context ...iculty is that the maximum likelihood (ML) parameters of these networks have no analytic closed form; finding these parameters requires an iterative procedure (such as conjugate gradient [15] or BFGS =-=[5]-=-), where each iteration runs inference over the current model. This type of procedure is computationally expensive even for models where inference is tractable. The problem of structure learning is co... |

11 |
Incremental feature selection and L1 regularization for relaxed maximum-entropy modeling
- Riezler, Vasserman
(Show Context)
Citation Context ...ective algorithms for L1-regularized generalized linear models (e.g., [18, 10, 9]), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars =-=[19]-=-. Surprisingly, the use of L1-regularization has not been proposed for the purpose of structure learning in general Markov networks. In this paper, we explore this approach, and discuss issues that ar... |

10 |
Loopy belief propogation and gibbs measures
- Tatikonda, Jordan
- 2002
(Show Context)
Citation Context ...s they return are typically much less accurate. Moreover, non-convergenceof the inference is more common when the network parameters are allowed to take larger, more extreme values; see, for example, =-=[20, 11, 13]-=- for some theoretical results supporting this empirical phenomenon. Thus, it is important to keep the model amenable to approximate inference, and thereby continue to improve, for as long as possible.... |

5 |
Inferring graphical model structure using ℓ1-regularized pseudo-likelihood
- Wainwright, Ravikumar, et al.
- 2006
(Show Context)
Citation Context ...c regression classifier for each variable given all of the others, and then use only the edges that are used in these individual classifiers. This approach is similar to the work of Wainwright et al. =-=[22]-=- (done in parallel with our work), who proposed the use of L1-regularized pseudo-likelihood for asymptotically learning a Markov network structure. 5 The Use of Approximate Inference All of the steps ... |

3 | General lower bounds based on computer generated higher order expansions
- Leisink, Kappen
- 2002
(Show Context)
Citation Context ...s they return are typically much less accurate. Moreover, non-convergenceof the inference is more common when the network parameters are allowed to take larger, more extreme values; see, for example, =-=[20, 11, 13]-=- for some theoretical results supporting this empirical phenomenon. Thus, it is important to keep the model amenable to approximate inference, and thereby continue to improve, for as long as possible.... |