## Parameter learning of logic programs for symbolic-statistical modeling (2001)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www.cs.umd.edu]
- [www.jair.org]
- [sato-www.cs.titech.ac.jp]
- [www.cs.washington.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Artificial Intelligence Research |

Citations: | 91 - 19 self |

### BibTeX

@ARTICLE{Sato01parameterlearning,

author = {Taisuke Sato and Yoshitaka Kameya},

title = {Parameter learning of logic programs for symbolic-statistical modeling},

journal = {Journal of Artificial Intelligence Research},

year = {2001},

pages = {454}

}

### Years of Citing Articles

### OpenURL

### Abstract

We propose a logical/mathematical framework for statistical parameter learning of parameterized logic programs, i.e. de nite clause programs containing probabilistic facts with a parameterized distribution. It extends the traditional least Herbrand model semantics in logic programming to distribution semantics, possible world semantics with a probability distribution which is unconditionally applicable to arbitrary logic programs including ones for HMMs, PCFGs and Bayesian networks. We also propose a new EM algorithm, the graphical EM algorithm, thatrunsfora class of parameterized logic programs representing sequential decision processes where each decision is exclusive and independent. It runs on a new data structure called support graphs describing the logical relationship between observations and their explanations, and learns parameters by computing inside and outside probability generalized for logic programs. The complexity analysis shows that when combined with OLDT search for all explanations for observations, the graphical EM algorithm, despite its generality, has the same time complexity as existing EM algorithms, i.e. the Baum-Welch algorithm for HMMs, the Inside-Outside algorithm for PCFGs, and the one for singly connected Bayesian networks that have beendeveloped independently in each research eld. Learning experiments with PCFGs using two corpora of moderate size indicate that the graphical EM algorithm can signi cantly outperform the Inside-Outside algorithm. 1.

### Citations

8530 |
Introduction to algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ... able to use balanced trees, giving O(log n) access time where n is the number data in the solution table, or we may be able to use hashing, giving average O(1) time access under a certain condition (=-=Cormen, Leiserson, & Rivest, 1990-=-). 39. We treat here only \state-emission HMMs" which emit a symbol depending on the state. Another type, \arc-emission HMMs" in which the emitted symbol depends on the transition arc, is treated simi... |

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...daptation of the EM algorithm to our semantics. The uniqueness condition guarantees that there exists a (many-to-one) mapping from explanations to observations so that the EM algorithm is applicable (=-=Dempster et al., 1977-=-). It is possible, however, to relax the uniqueness condition while justifying the application of the EM algorithm. We assume the MAR (missing at random) condition introduced by Rubin (1976) which is ... |

4273 | A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...ts play the role of abducibles, i.e. primitive hypotheses. 4 Statistical abduction is powerful in that it not only subsumes diverse symbolic-statistical frameworks such as HMMs (hidden Markov models, =-=Rabiner, 1989-=-), PCFGs (probabilistic context free grammars, Wetherell, 1980; Manning & Schutze, 1999) and (discrete) Bayesian networks (Pearl, 1988; Castillo, Gutierrez, & Hadi, 1997) but gives us freedom of using... |

1855 |
Foundations of Logic Programming
- Lloyd
- 1987
(Show Context)
Citation Context ...A holds. In case of n = 0, the clause is called a unit clause. A general clause is one whose body may contain negated atoms. A program including general clauses is sometimes called a general program (=-=Lloyd, 1984; Doets, 1994). 2. Throug-=-hout this paper, for familiarity and readability, we will somewhat loosely use "distribution" as a synonym for "probability measure". 3. In logic programming, the adjective "g... |

1489 |
Fundamentals of Speech Recognition
- Rabiner, Juang
- 1993
(Show Context)
Citation Context ...2, f3, h1, h2 and h3 are temporary marks, not part of the program. 32. An HMM de nes a probability distribution over strings in the given set of alphabets, and works as a stochastic string generator (=-=Rabiner & Juang, 1993-=-) such that an output string is a sample from the de ned distribution. 413Sato & Kameya and also an alphabet from fa; bg to emit. Note that to specify a fact set F h and the associated distribution c... |

1055 | Inductive logic programming
- Muggleton
- 1991
(Show Context)
Citation Context ...problem, the possibility of assigning conflicting probabilities to logically equivalent formulas. In SLPs, P(A) and P(AsA) do not necessarily coincide because A and AA may have different refutations (=-=Muggleton, 1996-=-; Cussens, 1999, 2001). Consequently in SLPs, we would be in trouble if we naively interpret P(A) as the probability of A's being true. Also assigning probabilities to arbitrary quantified formulas se... |

960 |
Negation as failure
- Clark
- 1978
(Show Context)
Citation Context ...! x = y j f is a function symbolg [ ff(x) 6= g(y) j f and g are different function symbolsg [ ft 6= x j t is a term properly containing xg comp(R) def = iff (R) [ E q E q , Clark's equational theory (=-=Clark, 1978-=-), deductively simulates unification. Likewise comp(R) is a first-order theory which deductively simulates SLD refutation with the help of E q by replacing a clause head atom with the clause body (Llo... |

942 |
The EM Algorithm and Extensions
- Mclachlan, Krishnan
- 1996
(Show Context)
Citation Context ... Haddawy's (1997) framework but they assume domains are nite and function symbols seem prohibited. 6. \EM algorithm" stands for a class of iterative algorithms for ML estimation with incomplete data (=-=McLachlan & Krishnan, 1997-=-). 392Parameter Learning of Logic Programs for Symbolic-statistical Modeling and learn the parameters of the distribution associated with the program. Redundancy in the second phase is removed by the... |

620 |
The Art of Prolog
- Sterling, Shapiro
- 1986
(Show Context)
Citation Context ...nition of a support set di ers from the one used by Sato (1995) and Kameya and Sato (2000). 19. When we implicitly emphasize the procedural reading of logic programs, Prolog conventions are employed (=-=Sterling & Shapiro, 1986-=-). Thus, ; stands for \or", , \and" :- \implied by" respectively. Strings beginning with a capital letter are (universally quanti ed) variables, but quoted ones such as 'A' are constants. The undersco... |

537 | The role of abduction in logic programming - Kakas, Kowalski, et al. - 1998 |

417 |
Probabilistic logic
- Nilsson
- 1986
(Show Context)
Citation Context ... have more or less similar limitations and potential problems. Descriptive power confined to finite domains is the most common limitation, which is due to the use of the linear programming technique (=-=Nilsson, 1986-=-), or due to the syntactic restrictions not allowing for infinitely many constant, function or predicate symbols (Ng & Subrahmanian, 1992; Lakshmanan & Sadri, 1994). Bayesian networks have the same li... |

373 |
The estimation of stochastic context-free grammars using the Inside–Outside algorithm. Computer Speech and Language
- Lari, Young
- 1990
(Show Context)
Citation Context ...c-statistical Modeling 4.6 Inside Probability and Outside Probability for Logic Programs In this subsection, we generalize the notion of inside probability and outside probability (Baker, 1979; Lari &=-=Young, 1990-=-) to logic programs. Major computations in learn-naive(DB,G) are those of two terms in Line 6, PDB (G t j `) and P S2/DB (G t ) P msw (S j `)oe i;v (S). Computational redundancy lurks in the naive com... |

342 |
Definite clause grammars for language analysis - a survey of the formalism and a comparison with augmented transition networks
- Pereira, Warren
- 1980
(Show Context)
Citation Context ... We now consider a non-propositional parsing program DBg ′ = Fg ′ ∪ Rg ′ in Figure 11 whose ground instances constitute the propositional program DBg. DBg ′ is a probabilistic variant of DCG program (=-=Pereira & Warren, 1980-=-) in which q’/1, q’/6 and between/3 are declared as table predicate. Semantically DBg ′ specifies a probability distribution over the atoms of the form {q’(l) | l is a list of terminals}. Fg ′ = {msw(... |

294 | Probabilistic Horn abduction and Bayesian networks - Poole - 1993 |

272 | Inside-Outside Reestimation from Partially Bracketed Corpora
- Pereira, Schabes
- 1992
(Show Context)
Citation Context ...e uniqueness condition and validates the use of the graphical EM algorithm even when a complete data does not uniquely determine the observed data just like the case of \partially bracketed corpora" (=-=Pereira & Schabes, 1992-=-), we feel the need to do more research on this topic. Also investigating the role of the acyclicity condition seems theoretically interesting as the acyclicity is often related to the learning of log... |

268 |
Trainable grammars for speech recognition
- Baker
- 1979
(Show Context)
Citation Context ...en combined with OLDT search for all explanations, the same time complexity as the specialized ones, e.g. the Baum-Welch algorithm for HMMs (Rabiner, 1989) and the Inside-Outside algorithm for PCFGs (=-=Baker, 1979-=-), despite its generality. What is surprising is that, when we conducted learning experiments with PCFGs using real corpora, it outperformed the Inside-Outside algorithm by orders of magnitudes in ter... |

260 | Inference and missing data - Rubin - 1976 |

218 |
OLD resolution with tabulation
- Tamaki, Sato
- 1986
(Show Context)
Citation Context ...rized logic program and observations, the rst phase searches for all explanations for the observations. Redundancy in the rst phase is eliminated by tabulating partial explanations using OLDT search (=-=Tamaki & Sato, 1986-=-; Warren, 1992; Sagonas, T., &Warren, 1994; Ramakrishnan, Rao, Sagonas, Swift, & Warren, 1995; Shen, Yuan, You, & Zhou, 2001). It returns a support graph which is a compact representation of the disco... |

213 |
Unfold/fold transformations of logic programs
- Tamaki, Sato
- 1984
(Show Context)
Citation Context ...nd Carroll (1994). sc-BN is a shorthand for a singly connected Bayesian network (Pearl, 1988). 8. We do not deal with general logic programs in this paper. 393Sato & Kameya next goal and unfolds it (=-=Tamaki & Sato, 1984-=-) into subgoals using a nondeterministically chosen clause. The computed result by the SLD refutation, i.e. a solution, isananswer substitution (variable binding) such that DB ` G . 9 Usually there is... |

211 | Xsb as an efficient deductive database engine
- Sagonas, Swift, et al.
- 1994
(Show Context)
Citation Context ...onential search time. One answer to avoid this problem is to store computed results and reuse them whenever necessary. OLDT is such an instance of memoizing scheme (Tamaki & Sato, 1986; Warren, 1992; =-=Sagonas et al., 1994-=-; Ramakrishnan et al., 1995; Shen et al., 2001). Reuse of proved subgoals in OLDT search often drastically reduces search time for all solutions, especially when refutations of the top goal include ma... |

187 | An E cient Probabilistic Context-Free Parsing Algorithm that Computes Pre x Probabilities
- Stolcke
- 1995
(Show Context)
Citation Context ...ture for the output). Finally we remark that the use of parsing as a preprocess for EM learning of PCFGs is not unique to the graphical EM algorithm (Fujisaki, Jelinek, Cocke, Black, & Nishino, 1989; =-=Stolcke, 1995-=-). These approaches however still seem to contain redundancies compared with the graphical EM algorithm. For instance Stolcke (1995) uses an Earley chart to compute inside and outside probability, but... |

174 | Expert Systems and Probabilistic Network Models - Castillo, Gutierrez, et al. - 1997 |

144 | Stochastic attribute-value grammars
- Abney
- 1997
(Show Context)
Citation Context ...s based on the "programs as distributions " scheme, stochastic natural language processing which exploits semantic information seems promising. For instance, unification-based grammars such =-=as HPSGs (Abney, 1997-=-) may be a good target beyond PCFGs because they use feature structures logically describable, and the ambiguity of feature values seems to be expressible by a probability distribution. Also building ... |

133 | Probabilistic logic programming
- Ng, Subrahmanian
- 1992
(Show Context)
Citation Context ...ation, which is due to the use of the linear programming technique (Nilsson, 1986), or due to the syntactic restrictions not allowing for infinitely many constant, function or predicate symbols (Ng & =-=Subrahmanian, 1992-=-; Lakshmanan & Sadri, 1994). Bayesian networks have the same limitation as well (only a finite number of random variables are representable). 68 Also there are various semantic/syntactic restrictions ... |

125 |
Memoing for logic programs
- Warren
- 1992
(Show Context)
Citation Context ...bservations, the first phase searches for all explanations for the observations. Redundancy in the first phase is eliminated by tabulating partial explanations using OLDT search (Tamaki & Sato, 1986; =-=Warren, 1992-=-; Sagonas, T., & Warren, 1994; Ramakrishnan, Rao, Sagonas, Swift, & Warren, 1995; Shen, Yuan, You, & Zhou, 2001). It returns a support graph which is a compact representation of the discovered explana... |

112 | From statistical knowledge bases to degrees of belief - Bacchus, Grove, et al. - 1996 |

107 | Valence induction with a head-lexicalized PCFG
- G, Rooth
- 1998
(Show Context)
Citation Context ...ithm has been recognized as a standard EM algortihm for training PCFGs, it is notoriously slow. Although there is not much literature explicitly stating time required by the Inside-Outside algorithm (=-=Carroll & Rooth, 1998-=-; Beil, Carroll, Prescher, Riezler, & Rooth, 1999), Beil et al. (1999) reported for example that when they trained a PCFG with 5,508 rules for a corpus of 450,526 German subordinate clauses whose aver... |

95 | A statistical learning method for logic programs with distribution semantics
- Sato
- 1995
(Show Context)
Citation Context ... on. The resulting P F is a probability measure over the infinite product of independent binary outcomes. It might look too simple but expressive enough for Bayesian networks, Markov chains and HMMs (=-=Sato, 1995-=-; Sato & Kameya, 1997). 3.2 Extending P F to PDB In this subsection, we extend P F to a probability measure PDB over the possible worlds for L, i.e. the set of all possible truth assignments to ground... |

94 | Answering queries from context-sensitive probabilistic knowledge bases - Ngo, Haddawy - 1997 |

84 | Construction of Belief and Decision Networks - Breese - 1991 |

81 |
From Logic to Logic Programming
- Doets
- 1994
(Show Context)
Citation Context ...ase of n = 0, the clause is called a unit clause. A general clause is one whose body may contain negated atoms. A program including general clauses is sometimes called a general program (Lloyd, 1984; =-=Doets, 1994). 2. Throughout this paper, f-=-or familiarity and readability, we will somewhat loosely use "distribution" as a synonym for "probability measure". 3. In logic programming, the adjective "ground" means ... |

70 | Hybrid Probabilistic Programs - Dekhtyar, Subrahmanian - 2000 |

69 | Parameter estimation in stochastic logic programs - Cussens - 2001 |

60 |
Probabilistic Languages: A Review and Some Open Questions
- Wetherell
- 1980
(Show Context)
Citation Context ... Statistical abduction is powerful in that it not only subsumes diverse symbolic-statistical frameworks such as HMMs (hidden Markov models, Rabiner, 1989), PCFGs (probabilistic context free grammars, =-=Wetherell, 1980-=-; Manning & Schutze, 1999) and (discrete) Bayesian networks (Pearl, 1988; Castillo, Gutierrez, & Hadi, 1997) but gives us freedom of using arbitrarily complex logic programs for modeling. 5 The semant... |

57 |
A probabilistic parsing method for sentence disambiguation
- FUJISAKI, JELINEK, et al.
- 1991
(Show Context)
Citation Context ...ar (and the introduction of appropriate data structure for the output). Finally we remark that the use of parsing as a preprocess for EM learning of PCFGs is not unique to the graphical EM algorithm (=-=Fujisaki, Jelinek, Cocke, Black, & Nishino, 1989-=-; Stolcke, 1995). These approaches however still seem to contain redundancies compared with the graphical EM algorithm. For instance Stolcke (1995) uses an Earley chart to compute inside and outside p... |

57 | Probabilistic deductive databases
- Lakshmanan, Sadri
- 1994
(Show Context)
Citation Context ... to the use of the linear programming technique (Nilsson, 1986), or due to the syntactic restrictions not allowing for in nitely many constant, function or predicate symbols (Ng & Subrahmanian, 1992; =-=Lakshmanan & Sadri, 1994-=-). Bayesian networks have the same limitation as well (only a nite number of random variables are representable). 68 Also there are various semantic/syntactic restrictions on logic programs. For insta... |

54 |
Estimation of Probabilistic Context-Free Grammars
- Chi, Geman
- 1998
(Show Context)
Citation Context ...G, S is rewritten either to a with probability p or to SS with probability q. The probability of the occurrence of an in nite derivation is calculated as max f0; 1 0 (p=q)g which is non-zero when q>p(=-=Chi & Geman, 1998-=-). 399Sato & Kameya listed below. For a predicate p, we introduce i (p), the i de nition of p by i (p) def = 8x (p(x) $9y1(x = t1 ^ W1) _111_9yn(x = tn ^ Wn)) : Here x is a vector of new variables of... |

46 | Effective Bayesian inference for stochastic programs - Koller, McAllester, et al. - 1997 |

44 | Context-Sensitive Statistics For Improved Grammatical Language Models - Charniak, Carroll - 1994 |

44 | Learning probabilities for noisy first-order rules - Koller, Pfeffer |

44 | Probabilistic deduction with conditional constraints over basic events - Lukasiewicz - 1999 |

43 | PRISM: a language for symbolic-statistical modeling
- Sato, Kameya
- 1997
(Show Context)
Citation Context ...esulting PF is a probability measure over the in nite product of independent binary outcomes. It might look too simple but expressive enough for Bayesian networks, Markov chains and HMMs (Sato, 1995; =-=Sato & Kameya, 1997-=-). 3.2 Extending PF to PDB In this subsection, we extend PF to a probability measure PDB over the possible worlds for L, i.e. the set of all possible truth assignments to ground atoms in L through the... |

41 | Efficient inference in Bayes networks as a combinatorial optimization problem - Li, D’Ambrosio - 1994 |

41 | Efficient tabling mechanisms for logic programs
- RAMAKRISHNAN, RAO, et al.
- 1995
(Show Context)
Citation Context ...One answer to avoid this problem is to store computed results and reuse them whenever necessary. OLDT is such an instance of memoizing scheme (Tamaki & Sato, 1986; Warren, 1992; Sagonas et al., 1994; =-=Ramakrishnan et al., 1995-=-; Shen et al., 2001). Reuse of proved subgoals in OLDT search often drastically reduces search time for all solutions, especially when refutations of the top goal include many common subrefutations. T... |

37 | The EDR Electronic Dictionary Technical Guide - EDR - 1995 |

35 | Loglinear models for first-order probabilistic reasoning
- Cussens
- 1999
(Show Context)
Citation Context ...ibility of assigning conflicting probabilities to logically equivalent formulas. In SLPs, P(A) and P(AsA) do not necessarily coincide because A and AA may have different refutations (Muggleton, 1996; =-=Cussens, 1999-=-, 2001). Consequently in SLPs, we would be in trouble if we naively interpret P(A) as the probability of A's being true. Also assigning probabilities to arbitrary quantified formulas seems out of scop... |

33 | Generalized Queries on Probabilistic ContextFree Grammars - Pynadath, Wellman - 1998 |

32 | Probabilistic constraint logic programming - Riezler - 1997 |

31 | Inference in Bayesian networks
- D'Ambrosio
- 1999
(Show Context)
Citation Context ...eterized logic program that carries out the given summations in the given order for an arbitrary Bayesian network, in particular we are able to simulate VE (variable elimination, Zhang & Poole, 1996; =-=D'Ambrosio, 1999-=-) in our approach. Efficient computation of marginal distributions is not always possible but there is a well-known class of Bayesian networks, singly connected Bayesian networks, for which there exis... |

30 | Semantics and inference for recursive probability models - Pfeffer, Koller - 2000 |