## Structure learning of Markov logic networks through iterated local search (2008)

### Cached

### Download Links

- [www.di.uniba.it]
- [www.di.uniba.it]
- [www.comp.leeds.ac.uk]
- [www.di.uniba.it]
- [www.di.uniba.it]
- DBLP

### Other Repositories/Bibliography

Venue: | Proc. ECAI’08 |

Citations: | 17 - 2 self |

### BibTeX

@INPROCEEDINGS{Biba08structurelearning,

author = {Marenglen Biba and Stefano Ferilli and Floriana Esposito},

title = {Structure learning of Markov logic networks through iterated local search},

booktitle = {Proc. ECAI’08},

year = {2008}

}

### OpenURL

### Abstract

Many real-world applications of AI require both probability and first-order logic to deal with uncertainty and structural complexity. Logical AI has focused mainly on handling complexity, and statistical AI on handling uncertainty. Markov Logic Networks (MLNs) are a powerful representation that combine Markov Networks (MNs) and first-order logic by attaching weights to first-order formulas and viewing these as templates for features of MNs. State-of-theart structure learning algorithms of MLNs maximize the likelihood of a relational database by performing a greedy search in the space of candidates. This can lead to suboptimal results because of the incapability of these approaches to escape local optima. Moreover, due to the combinatorially explosive space of potential candidates these methods are computationally prohibitive. We propose a novel algorithm for learning MLNs structure, based on the Iterated Local Search (ILS) metaheuristic that explores the space of structures through a biased sampling of the set of local optima. The algorithm focuses the search not on the full space of solutions but on a smaller subspace defined by the solutions that are locally optimal for the optimization engine. We show through experiments in two real-world domains that the proposed approach improves accuracy and learning time over the existing state-of-the-art algorithms. 1

### Citations

863 | Learning logical definitions from relations
- Quinlan
- 1990
(Show Context)
Citation Context ...um accuracy and coverage criterion. In the learning from entailment setting, the system searches for clauses that entail all positive examples of some relation and no negative ones. For example, FOIL =-=[21]-=- learns each definite clause by starting with the target relation as the head and greedily adding literals to the body. MN weights have traditionally been learned using iterative scaling [5]. However,... |

591 | Markov logic networks
- Richardson, Domingos
(Show Context)
Citation Context ...tional dependency networks [18], and others. All these approaches combine probabilistic graphical models with subsets of first-order logic (e.g., Horn Clauses). In this paper we focus on Markov logic =-=[22]-=-, a powerful representation that has finite 1 Department of Computer Science, University of Bari, Italy, email: {biba,ferilli,esposito}@di.uniba.it first-order logic and probabilistic graphical models... |

563 | Inducing features of random fields
- Pietra, S, et al.
- 1997
(Show Context)
Citation Context ...6 and conclude in Section 7. 2 Markov Networks and Markov Logic Networks A MN (also known as Markov random field) is a model for the joint distribution of a set of variables X = (X1,X2,. . . ,Xn) ∈ χ =-=[5]-=-. It is composed of an undirected graph G and a set of potential functions. The graph has a node for each variable, and the model has a potential function φk for each clique in the graph. A potential ... |

527 | Learning probabilistic relational models - Friedman, Getoor, et al. - 1999 |

463 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ... change as the result of introducing or modifying a clause, in practice this is very rare. Secondorder, quadratic-convergence methods like L-BFGS are known to be very fast if started near the optimum =-=[23]-=-. This is what happened in [11]: L-BFGS typically converges in just a few iterations, sometimes one. We use the same approach for setting the parameters that optimize the WPLL. 4 Iterated Local Search... |

425 |
Probabilistic logic
- Nilsson
- 1986
(Show Context)
Citation Context ...hasize handling uncertainty. However, intelligent agents must be able to handle both for real-world applications. The first attempts to integrate logic and probability in AI date back to the works in =-=[1, 8, 19]-=-. Later, several authors began using logic programs to compactly specify Bayesian networks, an approach known as knowledge-based model construction [26]. Recently, in the burgeoning field of statistic... |

276 |
Statistical analysis of non-lattice data
- Besag
- 1975
(Show Context)
Citation Context ...lauses) and weight learning (setting the weight of each clause). In [22] structure learning was performed through ILP methods [13] followed by a weight learning phase in which maximumpseudolikelihood =-=[2]-=- weights were learned for each learned clause. State-of-the-art algorithms for structure learning are those in [11, 16] where learning of MLNs is performed in a single step using weighted pseudo-likel... |

274 | An analysis of first-order logics of probability
- Halpern
- 1990
(Show Context)
Citation Context ...hasize handling uncertainty. However, intelligent agents must be able to handle both for real-world applications. The first attempts to integrate logic and probability in AI date back to the works in =-=[1, 8, 19]-=-. Later, several authors began using logic programs to compactly specify Bayesian networks, an approach known as knowledge-based model construction [26]. Recently, in the burgeoning field of statistic... |

252 |
Representing and Reasoning with Probabilistic Knowledge
- Bacchus
- 1990
(Show Context)
Citation Context ...hasize handling uncertainty. However, intelligent agents must be able to handle both for real-world applications. The first attempts to integrate logic and probability in AI date back to the works in =-=[1, 8, 19]-=-. Later, several authors began using logic programs to compactly specify Bayesian networks, an approach known as knowledge-based model construction [26]. Recently, in the burgeoning field of statistic... |

247 |
Stochastic Local Search Foundations and Applications
- Hoos, Stützle
- 2005
(Show Context)
Citation Context ...lgorithms make use of randomized choice in generating or selecting candidate solutions for a given combinatorial problem instance. These algorithms are called stochastic local search (SLS) algorithms =-=[13]-=- and represent one of the most successful and widely used approaches for solving hard combinatorial problem. Many “simple” SLS methods come from other search methods by just randomizing the selection ... |

228 |
Logical Foundations of Artificial Intelligence
- Genesereth, Nilsson
- 1988
(Show Context)
Citation Context ...ply by replacing its variables with constants in all possible ways. 3 Structure and Parameter Learning of MLNs A first-order knowledge base (KB) is a set of sentences or formulas in first-order logic =-=[6]-=-. Formulas are constructed using four types of symbols: constants, variables, functions, and predicates. Constant symbols represent objects in the domain of interest. Variable symbols range over the o... |

222 |
Introduction to Statistical Relational Learning
- Getoor, Tasker
- 2007
(Show Context)
Citation Context ...omplex relational structure. Statistical learning focuses on the former, and relational learning on the latter. Probabilistic Inductive Logic Programming (PILP) [7] or Statistical Relational Learning =-=[10]-=- aim at combining the power of both. PILP and SRL can be viewed as combining ILP principles (such as refinement operators) with statistical learning. One of the representation formalisms in this area ... |

214 | The relationship between precision-recall and ROC curves
- Davis, Goadrich
- 2006
(Show Context)
Citation Context ...tputs for every grounding of the query predicate on the test fold. We used these values to compute the average CLL over all the groundings and the relative AUC (for AUC we used the method proposed in =-=[3]-=-). For ILS we report the best performance in terms of CLL among ten parallel independent walks. Both CLL and AUC results (Table 1) are averaged over all predicates of the domain. Learning times are re... |

197 | On bias, variance, 0/1 loss, and the curse-of-dimensionality
- Friedman
- 1997
(Show Context)
Citation Context ... consistent with the fact that, for the same representation, discriminative training has lower bias and higher variance than generative training, and the variance term dominates at small sample sizes =-=[8, 9]-=-. For the dataset sizes typically found in practice, however, the results in [12, 22, 11] all support the choice of discriminative training. An experimental comparisonof discriminative and generative... |

191 | Efficiently inducing features of conditional random fields. UAI
- McCallum
- 2003
(Show Context)
Citation Context ...of atomic features (the original variables), conjoining each current feature with each atomic feature, adding to the network the conjunction that most increases likelihood, and repeating. The work in =-=[15]-=- extends this to the case of conditional random fields, which are MNs trained to maximize the conditional likelihood of a set of outputs given a set of inputs. The first attempt to learn MLNs was that... |

126 |
Inductive Logic Programming: Techniques and Applications
- Larvac, Dzeroski
- 1994
(Show Context)
Citation Context ...ns. Learning an MLN consists in structure learning (learning the logical clauses) and weight learning (setting the weight of each clause). In [22] structure learning was performed through ILP methods =-=[13]-=- followed by a weight learning phase in which maximumpseudolikelihood [2] weights were learned for each learned clause. State-of-the-art algorithms for structure learning are those in [11, 16] where l... |

125 | Iterated local search
- Lourenco, Martin, et al.
- 2002
(Show Context)
Citation Context ...fundamental issue of escaping local optima is to use two types of SLS steps: one for reaching local optima as efficiently as possible, and the other for effectively escaping local optima. ILS methods =-=[9, 14]-=- exploit this key idea, and essentially use two types of search steps alternatingly to perform a walk in the space of local optima w.r.t. the given evaluation function. The algorithm works as follows:... |

106 |
P.: The Alchemy system for statistical relational AI
- Kok, Singla, et al.
- 2005
(Show Context)
Citation Context ...e number of possible equivalences is very large, we used the canopies found in [24] to make this problem tractable. 6.2 Systems and Methodology We implemented Algorithm 1 (ILS) in the Alchemy package =-=[12]-=-. We used the implementation of L-BFGS in Alchemy to learn maximum WPLL weights. We compared our algorithm performance with the state-of-the-art algorithms for generative structure learning of MLNs: B... |

99 |
Sound and efficient inference with probabilistic and deterministic dependencies
- Poon, Domingos
- 2006
(Show Context)
Citation Context ...e of the algorithm to a separate CPU on a cluster of Intel Core2 Duo 2.13 GHz CPUs.6.3 Results After learning the structure, we performed inference on the test fold for both datasets by using MC-SAT =-=[20]-=- with number of steps = 10000 and simulated annealing temperature = 0.5. For each experiment, all the groundings of the query predicates on the test fold were commented. MC-SAT produces probability ou... |

92 | Learning the structure of Markov logic networks
- Kok, Domingos
- 2005
(Show Context)
Citation Context ...ILP methods [13] followed by a weight learning phase in which maximumpseudolikelihood [2] weights were learned for each learned clause. State-of-the-art algorithms for structure learning are those in =-=[11, 16]-=- where learning of MLNs is performed in a single step using weighted pseudo-likelihood as the evaluation measure during structure search. However, these algorithms follow systematic search strategies ... |

80 | Discriminative training of markov logic networks - Singla, Domingos - 2005 |

78 | Entity resolution with markov logic
- Singla, Domingos
(Show Context)
Citation Context ...Both represent standard relational datasets and are used for two important relational tasks: Cora for entity resolution and UW-CSE for social network analysis. For Cora we used a cleaned version from =-=[24]-=-, with five splits for crossvalidation. The published UW-CSE dataset consists of 15 predicates divided into 10 types. Types include: publication, person, course, etc. Predicates include: Student(perso... |

77 | De Raedt, L.: Towards Combining Inductive Logic Programming with Bayesian Networks
- Kersting
(Show Context)
Citation Context ...urgeoning field of statistical relational learning [7], several approaches for combining logic and probability have been proposed such as probabilistic relational models [17], bayesian logic programs =-=[10]-=-, relational dependency networks [18], and others. All these approaches combine probabilistic graphical models with subsets of first-order logic (e.g., Horn Clauses). In this paper we focus on Markov ... |

66 | D.: Dependency Networks for Relational Data
- Neville, Jensen
(Show Context)
Citation Context ...onal learning [7], several approaches for combining logic and probability have been proposed such as probabilistic relational models [17], bayesian logic programs [10], relational dependency networks =-=[18]-=-, and others. All these approaches combine probabilistic graphical models with subsets of first-order logic (e.g., Horn Clauses). In this paper we focus on Markov logic [22], a powerful representation... |

62 | Structural extension to logistic regression: discriminative parameter learning of belief net classifiers
- Greinemr, Su, et al.
- 2005
(Show Context)
Citation Context ...g has lower bias and higher variance than generative training, and the variance term dominates at small sample sizes [8, 9]. For the dataset sizes typically found in practice, however, the results in =-=[12, 22, 11]-=- all support the choice of discriminative training. An experimental comparisonof discriminative and generative parameter training on both discriminatively and generatively structured Bayesian Network... |

62 | P.: Efficient weight learning for Markov logic networks
- Lowd, Domingos
(Show Context)
Citation Context ...orithm for discriminative weight learning of MLNs was shown to greatly outperform maximum-likelihood and pseudo-likelihood approches for two real-world prediction problems. Recently, the algorithm in =-=[21]-=-, outperforming the voted perceptron became the stateof-the-art method for discriminative weight learning of MLNs. However, both discriminative approches to MLNs learn weights for a fixed structure, g... |

47 | Bottom-up learning of Markov logic network structure
- Mihalkova, Mooney
- 2007
(Show Context)
Citation Context ...ILP methods [13] followed by a weight learning phase in which maximumpseudolikelihood [2] weights were learned for each learned clause. State-of-the-art algorithms for structure learning are those in =-=[11, 16]-=- where learning of MLNs is performed in a single step using weighted pseudo-likelihood as the evaluation measure during structure search. However, these algorithms follow systematic search strategies ... |

39 | Statistical relational learning for document mining
- Popescul, Ungar, et al.
- 2003
(Show Context)
Citation Context ...s. Our discriminative method falls among those approaches that tightly integrate ILP and statistical learning in a single step for structure learning. The earlier works in this direction are those in =-=[4, 26]-=- that employ statistical models such as maximum entropy modeling in [4] and logistic regression in [26]. These approaches can be computationally very expensive. A simpler approach that integrates FOIL... |

37 | Maximum entropy modeling with clausal constraints
- Dehaspe
- 1997
(Show Context)
Citation Context ...s. Our discriminative method falls among those approaches that tightly integrate ILP and statistical learning in a single step for structure learning. The earlier works in this direction are those in =-=[4, 26]-=- that employ statistical models such as maximum entropy modeling in [4] and logistic regression in [26]. These approaches can be computationally very expensive. A simpler approach that integrates FOIL... |

33 |
P.: kFOIL: Learning simple relational kernels
- Landwehr, Passerini, et al.
- 2006
(Show Context)
Citation Context ...hm incrementally builds a Bayes net during rule learning and each candidate rule is introduced in the network and scored by whether it improves the performance of the classifier. In a recent approach =-=[19]-=-, the kFOIL system integrates ILP and support vector learning. kFOIL constructs the featurespace by leveraging FOIL search for a set of relevant clauses. The search is driven by the performance obtai... |

32 | Memory-efficient inference in relational domains
- Singla, Domingos
- 2006
(Show Context)
Citation Context ...avoid this problem, we used a lazy version of MC-SAT, LazyMC-SAT [28] which reduces memory and time by orders of magnitude compared to MC-SAT. Before Lazy-MC-SAT was introduced, the LazySat algorithm =-=[33]-=- was shown to greatly reduce memory requirements by exploiting the sparseness of relational domains (i.e., only a small fraction of ground atoms are true, and most clauses are trivially satisfied). Th... |

31 |
V.: An integrated approach to learning Bayesian Networks of rules
- Davis, Burnside, et al.
- 2005
(Show Context)
Citation Context ...xpensive. A simpler approach that integrates FOIL [29] and Naïve Bayes is nFOIL proposed in [17]. This approach interleaves the steps of generating rules and scoring them through CLL. In another work =-=[2]-=- these steps are coupled by scoring the clauses through the improvement in classification accuracy. This algorithm incrementally builds a Bayes net during rule learning and each candidate rule is intr... |

31 | A general method for reducing the complexity of relational inference and its application to MCMC
- Poon, Domingos, et al.
- 2008
(Show Context)
Citation Context ...g parameters by maximum likelihood can produce better results in terms of predictive accuracy. Structures are scored through a very fast inference algorithm MC-SAT [27] whose lazy version Lazy-MC-SAT =-=[28]-=- greatly reduces memory requirements, while parameters are learned through a quasi-Newton optimization method like L-BFGS that has been found to be much faster [34] than iterative scaling initially us... |

24 | Raedt. nFOIL: Integrating Na¨ıve Bayes and Foil
- Landwehr, Kersting, et al.
- 2005
(Show Context)
Citation Context ...imum entropy modeling in [4] and logistic regression in [26]. These approaches can be computationally very expensive. A simpler approach that integrates FOIL [29] and Naïve Bayes is nFOIL proposed in =-=[17]-=-. This approach interleaves the steps of generating rules and scoring them through CLL. In another work [2] these steps are coupled by scoring the clauses through the improvement in classification acc... |

19 | Markov logic in infinite domains
- Singla, Domingos
- 2007
(Show Context)
Citation Context ...ds first-order logic by attaching weights to formulas providing the full expressiveness of graphical models and first-order logic in finite domains and remaining well defined in many infinite domains =-=[22, 25]-=-. Weighted formulas are viewed as templates for constructing MNs and in the infinite-weight limit, Markov logic reduces to standard first-order logic. In Markov logic it is avoided the assumption of i... |

19 | Discriminative versus generative parameter and structure learning of Bayesian network classifiers - Pernkopf, Bilmes - 2005 |

17 |
On the optimality of the simple Bayesian classi er under zero-one loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ... consistent with the fact that, for the same representation, discriminative training has lower bias and higher variance than generative training, and the variance term dominates at small sample sizes =-=[8, 9]-=-. For the dataset sizes typically found in practice, however, the results in [12, 22, 11] all support the choice of discriminative training. An experimental comparisonof discriminative and generative... |

16 |
P.: From Knowledge Bases to Decision Models. The Knowledge Engineering Review
- Wellman, Breese, et al.
- 1992
(Show Context)
Citation Context ...obability in AI date back to the works in [1, 8, 19]. Later, several authors began using logic programs to compactly specify Bayesian networks, an approach known as knowledge-based model construction =-=[26]-=-. Recently, in the burgeoning field of statistical relational learning [7], several approaches for combining logic and probability have been proposed such as probabilistic relational models [17], baye... |

11 |
Integrating naive Bayes and Foil
- Landwehr, Kersting, et al.
- 2007
(Show Context)
Citation Context ... of relevant clauses. The search is driven by the performance obtained by a support vector machine based on the resulting kernel. The authors showed that kFOIL improves over nFOIL. Recently, in TFOIL =-=[18]-=-, Tree Augmented Naïve Bayes, a generalization of Naïve Bayes was integrated with FOIL and it was shown that TFOIL outperforms nFOIL. The most closely related approach to the DSL algorithm is nFOIL (a... |

5 |
Probabilistic inductive logic programming: Theory and applications
- DeRaedt, Frasconi, et al.
- 2008
(Show Context)
Citation Context ...characterized by both uncertainty and complex relational structure. Statistical learning focuses on the former, and relational learning on the latter. Probabilistic Inductive Logic Programming (PILP) =-=[7]-=- or Statistical Relational Learning [10] aim at combining the power of both. PILP and SRL can be viewed as combining ILP principles (such as refinement operators) with statistical learning. One of the... |

5 |
The Alchemy system for statistical relational AI (Tech. Rep
- Kok, Singla, et al.
- 2005
(Show Context)
Citation Context ...able. The dataset contains a total of 70367 tuples (true and false ground atoms, with the remainder assumed false). 6.2 Systems and Methodology We implemented the DSL algorithm in the Alchemy package =-=[15]-=-. We used the implementation of L-BFGS and Lazy-MC-SAT in Alchemy to learn maximum WPLL weights and compute CLL during clause search. Regarding parameter learning, we compared our algorithm performanc... |

3 |
Learning logical denitions from relations
- Quinlan
- 1990
(Show Context)
Citation Context ...that employ statistical models such as maximum entropy modeling in [4] and logistic regression in [26]. These approaches can be computationally very expensive. A simpler approach that integrates FOIL =-=[29]-=- and Naïve Bayes is nFOIL proposed in [17]. This approach interleaves the steps of generating rules and scoring them through CLL. In another work [2] these steps are coupled by scoring the clauses thr... |

2 |
Learning bayesian network classiers by maximizing conditional likelihood
- Grossman, Domingos
- 2004
(Show Context)
Citation Context ...g has lower bias and higher variance than generative training, and the variance term dominates at small sample sizes [8, 9]. For the dataset sizes typically found in practice, however, the results in =-=[12, 22, 11]-=- all support the choice of discriminative training. An experimental comparisonof discriminative and generative parameter training on both discriminatively and generatively structured Bayesian Network... |

2 |
On discriminative vs. generative: A comparison of logistic regression and naive Bayes
- Ng, Jordan
(Show Context)
Citation Context ... parameters taken from [24] in terms of number of variables and literals per clause, while for DSL we did not optimize any parameter. Regarding question (Q5), the goal was whether previous results of =-=[22]-=- carry on to MLNs, that on small datasets generative approaches can perform better than discriminative ones. The UW-CSE dataset with a total of 2673 tuples can be considered of much smaller size compa... |