## Efficient weight learning for Markov logic networks (2007)

### Cached

### Download Links

Venue: | In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases |

Citations: | 59 - 7 self |

### BibTeX

@INPROCEEDINGS{Lowd07efficientweight,

author = {Daniel Lowd and Pedro Domingos},

title = {Efficient weight learning for Markov logic networks},

booktitle = {In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases},

year = {2007},

pages = {200--211}

}

### OpenURL

### Abstract

Abstract. Markov logic networks (MLNs) combine Markov networks and first-order logic, and are a powerful and increasingly popular representation for statistical relational learning. The state-of-the-art method for discriminative learning of MLN weights is the voted perceptron algorithm, which is essentially gradient descent with an MPE approximation to the expected sufficient statistics (true clause counts). Unfortunately, these can vary widely between clauses, causing the learning problem to be highly ill-conditioned, and making gradient descent very slow. In this paper, we explore several alternatives, from per-weight learning rates to second-order methods. In particular, we focus on two approaches that avoid computing the partition function: diagonal Newton and scaled conjugate gradient. In experiments on standard SRL datasets, we obtain order-of-magnitude speedups, or more accurate models given comparable learning times. 1

### Citations

1914 |
Numerical Optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...nce in some weights is too small for fast convergence in others. This is an instance of the well-known problem of ill-conditioning in numerical optimization, and many candidate solutions for it exist =-=[13]-=-. However, the most common ones are not easily applicable to MLNs because of the nature of the function being optimized. As in Markov random fields, computing the likelihood in MLNs requires computing... |

926 | On the statistical analysis of dirty pictures
- Besag
- 1986
(Show Context)
Citation Context ...ta likelihood. In this section, we describe a number of alternative algorithms for this purpose. Richardson and Domingos [16] originally proposed learning weights generatively using pseudo-likelihood =-=[2]-=-. Pseudo-likelihood is the product of the conditional likelihood of each variable given the values of its neighbors in the data. While efficient for learning, it can give poor results when long chains... |

568 | Markov logic networks
- Richardson, Domingos
- 2011
(Show Context)
Citation Context ...gramming, and interest in it has grown rapidly in recent years [6]. One of the most powerful representations for SRL is Markov logic, which generalizes both Markov random fields and first-order logic =-=[16]-=-. Representing a problem as a Markov logic network (MLN) involves simply writing down a list of first-order formulas and learning weights for those formulas from data. The first step is the task of th... |

518 | Training products of experts by minimizing contrastive divergence
- Hinton
(Show Context)
Citation Context ...bly more accurate and stable, since it converges to the true expectations in the limit. While running an MCMC algorithm to convergence at each iteration of gradient descent is infeasibly slow, Hinton =-=[8]-=- has shown that a few iterations of MCMC yield enough information to choose a good direction for gradient descent. Hinton named this method contrastive divergence, because it can be interpreted as opt... |

489 | Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms
- Collins
- 2002
(Show Context)
Citation Context ...owledge engineer; the second is the focus of this paper. Currently, the best-performing algorithm for learning MLN weights is Singla and Domingos’ voted perceptron [19], based on Collins’ earlier one =-=[3]-=- for hidden Markov models. Voted perceptron uses gradient descent to approximately optimize the conditional likelihood of the query atoms given the evidence. Weight learning in Markov logic is a conve... |

443 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...size the same way as in diagonal Newton. Conjugate gradient is usually more effective with a preconditioner, a linear transformation that attempts to reduce the condition number of the problem (e.g., =-=[17]-=-). Good preconditioners approximate the inverse Hessian. We use the inverse diagonal Hessian as our preconditioner. We refer to SCG with the preconditioner as PSCG. 4 Experiments 4.1 Datasets Our expe... |

300 | A scaled conjugate gradient algorithm for fast supervised learning.”NEURAL
- Møller
- 1993
(Show Context)
Citation Context ...ches, which involve computing the function as well as its gradient. These include most conjugate gradient and quasi-Newton methods (e.g., L-BFGS). Two exceptions to this are scaled conjugate gradient =-=[12]-=- and Newton’s method with a diagonalized Hessian [1]. In this paper we show how they can be applied to MLN learning, and verify empirically that they greatly speed up convergence. We also obtain good ... |

213 |
Introduction to Statistical Relational Learning
- Getoor, Taskar
- 2007
(Show Context)
Citation Context ...data points are not i.i.d. (independent and identically distributed). It combines ideas from statistical learning and inductive logic programming, and interest in it has grown rapidly in recent years =-=[6]-=-. One of the most powerful representations for SRL is Markov logic, which generalizes both Markov random fields and first-order logic [16]. Representing a problem as a Markov logic network (MLN) invol... |

102 |
The Alchemy system for statistical relational AI
- Kok, Sumner, et al.
- 2008
(Show Context)
Citation Context ...each split, we chose the learning rates that worked best on the corresponding validation set for each evaluation metric. We used the implementation of voted perceptron for MLNs in the Alchemy package =-=[11]-=-, and implemented the other algorithms as extensions of Alchemy. For DN, SCG, and PSCG, we started with λ = 1 and let the algorithm adjust it automatically. For algorithms based on MC-SAT, we used 5 s... |

97 |
Improving the Convergence of back-propagation Learning with Second order Methods
- Becker, leCun
- 1988
(Show Context)
Citation Context ... its gradient. These include most conjugate gradient and quasi-Newton methods (e.g., L-BFGS). Two exceptions to this are scaled conjugate gradient [12] and Newton’s method with a diagonalized Hessian =-=[1]-=-. In this paper we show how they can be applied to MLN learning, and verify empirically that they greatly speed up convergence. We also obtain good results with a simpler method: per-weight learning r... |

95 |
Sound and efficient inference with probabilistic and deterministic dependencies
- Poon, Domingos
(Show Context)
Citation Context ... MPE approximation is no longer sufficient. We address both of these problems by instead computing expected counts using MC-SAT, a very fast Markov chain Monte Carlo (MCMC) algorithm for Markov logic =-=[15]-=-. The remainder of this paper is organized as follows. In Section 2 we briefly review Markov logic. In Section 3 we present several algorithms for MLN weight learning. We compare these algorithms empi... |

95 | Accelerated training of conditional random fields with stochastic gradient methods
- Vishwanathan, Schraudolph, et al.
- 2006
(Show Context)
Citation Context ...y were in the tuning scenario. This extreme sensitivity to learning rate makes learning good models with VP and CD much more difficult. We also experimented with the stochastic meta-descent algorithm =-=[21]-=-, which automatically adjusts learning rates in each dimension, but found it to be too unstable for these domains. In sum, the MLN weight learning methods we have introduced in this paper greatly outp... |

87 | Learning the structure of Markov logic networks
- Kok, Domingos
(Show Context)
Citation Context ...probabilities of all worlds to sum to one. isThe formulas in an MLN are typically specified by an expert, or they can be obtained (or refined) by inductive logic programming or MLN structure learning =-=[10]-=-. Many complex models, and in particular many non-i.i.d. ones, can be very compactly specified using MLNs. Exact inference in MLNs is intractable. Instead, we can perform approximate inference using M... |

77 | Discriminative training of Markov logic networks
- Singla, Domingos
- 2005
(Show Context)
Citation Context ...The first step is the task of the knowledge engineer; the second is the focus of this paper. Currently, the best-performing algorithm for learning MLN weights is Singla and Domingos’ voted perceptron =-=[19]-=-, based on Collins’ earlier one [3] for hidden Markov models. Voted perceptron uses gradient descent to approximately optimize the conditional likelihood of the query atoms given the evidence. Weight ... |

76 | P.: Entity resolution with markov logic
- Singla, Domingos
- 2006
(Show Context)
Citation Context ...rent computer science papers, drawn from the Cora Computer Science Research Paper Engine. This dataset was originally labeled by Andrew McCallum 2 . We used a cleaned version from Singla and Domingos =-=[20]-=-, with five splits for cross-validation. The task on Cora is to predict which citations refer to the same paper, given the words in their author, title, and venue fields. The labeled data also specifi... |

71 | Fast exact multiplication by the Hessian
- Pearlmutter
- 1994
(Show Context)
Citation Context ...nside a quadratic form, as above, the value of this form can be computed simply as: d T Hd = (Ew[ � idini]) 2 − Ew[( � idini) 2 ] The product of the Hessian by a vector can also be computed compactly =-=[14]-=-. Note that α is computed using the full Hessian matrix, but the step direction is computed from the diagonalized approximation which is easier to invert. Our per-weight learning rates can actually be... |

68 | Relational learning with statistical predicate invention: Better models for hypertext
- Craven, Slattery
(Show Context)
Citation Context ... exceeded 3 million. The WebKB dataset consists of labeled web pages from the computer science departments of four universities. We used the relational version of the dataset from Craven and Slattery =-=[4]-=-, which features 4165 web pages and 10,935 web links, along with the words on the webpages, anchors of the links, and neighborhoods around each link. Each web page is marked with some subset of the ca... |

46 | A general stochastic approach to solving problems with hard and soft constraints. The Satisfiability Problem: Theory and Applications
- Kautz, Selman, et al.
- 1997
(Show Context)
Citation Context ...ime using the Viterbi algorithm. In MLNs, MPE inference is intractable but can be reduced to solving a weighted maximum satisfiability problem, for which efficient algorithms exist such as MaxWalkSAT =-=[9]-=-. Singla and Domingos [19] use this approach and discuss how the resulting algorithm can be viewed as approximately optimizing log-likelihood. However, the use of voted perceptron in MLNs is potential... |

41 |
An introduction to conjugate gradient method without the agonizing pain
- Shewchuck
- 1994
(Show Context)
Citation Context ...tion to this is to impose at each step the condition that the gradient along previous directions remain zero. The directions chosen in this way are called conjugate, and the method conjugate gradient =-=[18]-=-. We used the PolakRibiere method for choosing conjugate gradients since it has generally been found to be the best-performing one. Conjugate gradient methods are some of the most efficient available,... |

23 |
Practical Methods of Optimization, Wiley-Interscience
- Fletcher
- 2001
(Show Context)
Citation Context ... ∆pred be the predicted change in the function value from the previous gradient and Hessian and our last step, dt−1: ∆pred = d T t−1(gt−1 + Ht−1gt−1)/2 A standard method for adjusting λ is as follows =-=[5]-=-: if (∆actual/∆pred > 0.75) then λt+1 = λt/2 if (∆actual/∆pred < 0.25) then λt+1 = 4λt Since we cannot efficiently compute the actual change in log-likelihood, we approximate it as the product of the ... |