## Efficient structure learning of Markov networks using L1regularization (2006)

### Cached

### Download Links

Venue: | In NIPS |

Citations: | 106 - 2 self |

### BibTeX

@INPROCEEDINGS{Lee06efficientstructure,

author = {Su-in Lee and Varun Ganapathi and Daphne Koller},

title = {Efficient structure learning of Markov networks using L1regularization},

booktitle = {In NIPS},

year = {2006}

}

### Years of Citing Articles

### OpenURL

### Abstract

Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally effective method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: modeling the joint distribution of pixel values in the MNIST data, and modeling the joint distribution of genetic sequence variations in the human HapMap data. We show that our L1-based method achieves considerably higher generalization performance than the more standard L2-based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem. 1

### Citations

7440 |
Probabilistie Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...l to simply initialize the model to include all possible features: exact inference in such a model is almost invariably intractable, and approximate inference methods such as loopy belief propagation =-=[17]-=- are likely to give highly inaccurate estimates of the gradient, leading to poorly learned models. Thus, we propose an algorithm schema that gradually introduces features into the model, and lets the ... |

2038 | Regression shrinkage and selection via the LASSO
- Tibshirani
- 1996
(Show Context)
Citation Context ...se a technique that has become increasingly popular in the context of supervised learning, in problems that involve a large number of features, many of which may be irrelevant. It has been long known =-=[21]-=- that using L1-regularization over the model parameters — optimizing a joint objective that trades off fit to data with the sum of the absolute values of the parameters — tends to lead to sparse model... |

804 | Gradient-based learning applied to document recognition
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...y reasonable feature introduction method. We test our method on synthetic data generated from known MRFs and on two real-world tasks: modeling the joint distribution of pixel values in the MNIST data =-=[12]-=-, and modeling the joint distribution of genetic sequence variations — single-nucleotide polymorphisms (SNPs) — in the human HapMap data [3]. Our results show that L1-regularization out-performs other... |

665 |
Probabilistic Networks and Expert Systems
- Cowell, Lauritzen, et al.
- 1999
(Show Context)
Citation Context ... is calibrated, in that all of the belief potentials must agree [23]. Thus, we can view the subtree as a calibrated clique tree, and use standard dynamic programming methods over the tree (see, e.g.. =-=[4]-=-) to extract an approximate joint distribution over X k. We note that this computation is exact for tree-structured cluster graphs, but approximate otherwise, and that the choice of tree is not obviou... |

571 | Inducing features of random fields
- Pietra, S, et al.
- 1997
(Show Context)
Citation Context ...is considerably harder. The dominant type of solution to this problem uses greedy local heuristic search, which incrementally modifies the model by adding and possibly deleting features.sOne approach =-=[6, 14]-=- adds features so as to greedily improve the model likelihood; once a feature is added, it is never removed. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unn... |

493 | Loopy belief propagation for approximate inference: an empirical study
- Murphy, Weiss, et al.
- 1999
(Show Context)
Citation Context ...fining a feature which is an indicator function for every assignment of variables xc to a clique Xc. The mapping to Markov networks is useful, as most inference algorithms, such as belief propagation =-=[17, 16]-=-, operate on the graph structure of the Markov network. The standard learning problem for MRFs is formulated as follows. We are given a set of IID training instances D = {x[1], . . . , x[M]}, each con... |

410 | Generalized belief propagation
- Yedidia, Freeman, et al.
- 2000
(Show Context)
Citation Context ...e, which results in errors in the gradient. While many approximate inference methods have been proposed, one of the most commonly used is the general class of loopy belief propagation (BP) algorithms =-=[17, 16, 24]-=-. The use of an approximate inference algorithm such as BP raises several important points. One important question issue relates to the computation of the gradient or the gain for features that are cu... |

381 | Uncertainty principles and ideal atomic decomposition
- Donoho, Huo
(Show Context)
Citation Context ...limit, redundant features are eventually eliminated, so that the learning eventually converges to a minimal structure consistent with the underlying distribution. Similar results were shown by Donoho =-=[8]-=-, and can perhaps be adapted to this case. A key limiting factor in MRF learning, and in our approach, is the fact that it requires inference over the model. While our experiments suggest that approxi... |

294 | Multiple kernel learning, conic duality
- Bach, Lanckriet, et al.
- 2004
(Show Context)
Citation Context ...e of the induced Markov network. We can extend our approach to introduce such a bias by using a variant of L1 regularization that penalizes blocks of parameters together, such as the block-L1-norm of =-=[2]-=-. From a theoretical perspective, it would be interesting to show that, at the large sample limit, redundant features are eventually eliminated, so that the learning eventually converges to a minimal ... |

191 | Efficiently inducing features of conditional random fields. UAI
- McCallum
- 2003
(Show Context)
Citation Context ...is considerably harder. The dominant type of solution to this problem uses greedy local heuristic search, which incrementally modifies the model by adding and possibly deleting features.sOne approach =-=[6, 14]-=- adds features so as to greedily improve the model likelihood; once a feature is added, it is never removed. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unn... |

142 | Feature selection, L1 vs. L2 regularization, and rotational invariance - Ng - 2004 |

123 | Large-Scale Bayesian Logistic Regression for Text Categorization
- Genkin, Lewis, et al.
- 2006
(Show Context)
Citation Context ...this approach is useful for selecting the relevant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., =-=[18, 10, 9]-=-), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the pu... |

121 | 1-norm Support Vector Machines
- Zhu, Rosset, et al.
- 2003
(Show Context)
Citation Context ...evant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., [18, 10, 9]), support vector machines (e.g., =-=[25]-=-), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the purpose of structure learning in general... |

80 | Grafting: Fast, incremental feature selection by gradient descent in function space
- Perkins, Lacker, et al.
(Show Context)
Citation Context ...this approach is useful for selecting the relevant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., =-=[18, 10, 9]-=-), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the pu... |

69 | Loopy belief propagation: Convergence and effects of message errors
- Ihler, Fisher, et al.
- 2005
(Show Context)
Citation Context ...s they return are typically much less accurate. Moreover, non-convergenceof the inference is more common when the network parameters are allowed to take larger, more extreme values; see, for example, =-=[20, 11, 13]-=- for some theoretical results supporting this empirical phenomenon. Thus, it is important to keep the model amenable to approximate inference, and thereby continue to improve, for as long as possible.... |

58 | Exponential priors for maximum entropy models
- Goodman
- 2004
(Show Context)
Citation Context ...this approach is useful for selecting the relevant features from a large number of irrelevant ones. Other recent work proposes effective algorithms for L1-regularized generalized linear models (e.g., =-=[18, 10, 9]-=-), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars [19]. Surprisingly, the use of L1-regularization has not been proposed for the pu... |

55 | Performance guarantees for regularized maximum entropy density estimation - Dudik, Phillips, et al. - 2004 |

53 |
Algorithms for maximum-likelihood logistic regression
- Minka
- 2001
(Show Context)
Citation Context ... The key difficulty is that the maximum likelihood (ML) parameters of these networks have no analytic closed form; finding these parameters requires an iterative procedure (such as conjugate gradient =-=[15]-=- or BFGS [5]), where each iteration runs inference over the current model. This type of procedure is computationally expensive even for models where inference is tractable. The problem of structure le... |

45 | Thin junction trees
- Bach, Jordan
- 2001
(Show Context)
Citation Context ...emoved. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unnecessary features, and thereby to overly complex structures and overfitting. An alternative approach =-=[1, 7]-=- explicitly searches over the space of low-treewidth models, but the utility of such models in practice is unclear; indeed, hand-designed models for real-world problems generally do not have low tree-... |

29 | Efficient stepwise selection in decomposable models
- Deshpande, Garofalakis, et al.
- 2001
(Show Context)
Citation Context ...emoved. As the feature addition step is heuristic and greedy, this can lead to the inclusion of unnecessary features, and thereby to overly complex structures and overfitting. An alternative approach =-=[1, 7]-=- explicitly searches over the space of low-treewidth models, but the utility of such models in practice is unclear; indeed, hand-designed models for real-world problems generally do not have low tree-... |

23 | Tree-based modeling and estimation of Gaussian processes on graphs with cycles
- Wainwright, Sudderth, et al.
- 2001
(Show Context)
Citation Context ...he calibrated loopy graph that contains all of the variables in Xk. At convergence of the BP algorithm, every subtree of the loopy graph is calibrated, in that all of the belief potentials must agree =-=[23]-=-. Thus, we can view the subtree as a calibrated clique tree, and use standard dynamic programming methods over the tree (see, e.g.. [4]) to extract an approximate joint distribution over X k. We note ... |

14 |
Notes on cg and lm-bfgs optimization of logistic regression
- Daumé
- 2004
(Show Context)
Citation Context ...iculty is that the maximum likelihood (ML) parameters of these networks have no analytic closed form; finding these parameters requires an iterative procedure (such as conjugate gradient [15] or BFGS =-=[5]-=-), where each iteration runs inference over the current model. This type of procedure is computationally expensive even for models where inference is tractable. The problem of structure learning is co... |

9 |
Incremental feature selection and l1 regularization for relaxed maximum-entropy modeling
- Riezler, Vasserman
- 2004
(Show Context)
Citation Context ...ective algorithms for L1-regularized generalized linear models (e.g., [18, 10, 9]), support vector machines (e.g., [25]), and feature selection in log-linear models encoding natural language grammars =-=[19]-=-. Surprisingly, the use of L1-regularization has not been proposed for the purpose of structure learning in general Markov networks. In this paper, we explore this approach, and discuss issues that ar... |

9 |
Loopy belief propogation and gibbs measures
- Tatikonda, Jordan
- 2002
(Show Context)
Citation Context ...s they return are typically much less accurate. Moreover, non-convergenceof the inference is more common when the network parameters are allowed to take larger, more extreme values; see, for example, =-=[20, 11, 13]-=- for some theoretical results supporting this empirical phenomenon. Thus, it is important to keep the model amenable to approximate inference, and thereby continue to improve, for as long as possible.... |

6 |
Inferring graphical model structure using ℓ1 -regularized pseudolikelihood
- Wainwright, Ravikumar, et al.
- 2006
(Show Context)
Citation Context ...c regression classifier for each variable given all of the others, and then use only the edges that are used in these individual classifiers. This approach is similar to the work of Wainwright et al. =-=[22]-=- (done in parallel with our work), who proposed the use of L1-regularized pseudo-likelihood for asymptotically learning a Markov network structure. 5 The Use of Approximate Inference All of the steps ... |

3 | General lower bounds based on computer generated higher order expansions
- R, Kappen
- 2002
(Show Context)
Citation Context ...s they return are typically much less accurate. Moreover, non-convergenceof the inference is more common when the network parameters are allowed to take larger, more extreme values; see, for example, =-=[20, 11, 13]-=- for some theoretical results supporting this empirical phenomenon. Thus, it is important to keep the model amenable to approximate inference, and thereby continue to improve, for as long as possible.... |