## Active Learning of Dynamic Bayesian Networks in Markov Decision Processes

Citations: | 8 - 2 self |

### BibTeX

@MISC{Jonsson_activelearning,

author = {Anders Jonsson and Andrew Barto},

title = {Active Learning of Dynamic Bayesian Networks in Markov Decision Processes},

year = {}

}

### OpenURL

### Abstract

Abstract. Several recent techniques for solving Markov decision processes use dynamic Bayesian networks to compactly represent tasks. The dynamic Bayesian network representation may not be given, in which case it is necessary to learn it if one wants to apply these techniques. We develop an algorithm for learning dynamic Bayesian network representations of Markov decision processes using data collected through exploration in the environment. To accelerate data collection we develop a novel scheme for active learning of the networks. We assume that it is not possible to sample the process in arbitrary states, only along trajectories, which prevents us from applying existing active learning techniques. Our active learning scheme selects actions that maximize the total entropy of distributions used to evaluate potential refinements of the networks. 1

### Citations

3773 | Reinforcement Learning: An Introduction
- Sutton, Barto
- 1998
(Show Context)
Citation Context ...multiply the penalty term by a parameter λ. Results show that varying λ can increase the accuracy of the learned DBN model. Our work is related to the problem of exploration in reinforcement learning =-=[15]-=-. Existing exploration techniques do not learn DBN models of MDPs. Since there exist several efficient algorithms that use DBNs to solve factored MDPs, there is a benefit to learning this representati... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...otential refinements are evaluated as soon as the threshold is exceeded. The algorithm uses the Bayesian Information Criterion (BIC) [13] and the likelihood-equivalent Bayesian Dirichlet metric (BDe) =-=[9]-=- to evaluate potential refinements. We assume that no data is available to begin with and develop a technique for active learning of DBNs to accelerate data collection. The time to collect data is min... |

457 |
A model for reasoning about persistence and causation
- T, Kanazawa
- 1989
(Show Context)
Citation Context ...rrent situation as a single observation. Factored MDPs use a set of state variables to represent the state in a way that is more appropriate for tasks of this type. Dynamic Bayesian networks, or DBNs =-=[1]-=-, are particularly well suited for exploiting structure in factored MDPs by capturing conditional independence between state variables as a result of executing actions. Several researchers have develo... |

370 | Hierarchical reinforcement learning with the MAXQ value function decomposition
- Dietterich
(Show Context)
Citation Context ...areas of the state space, although it is unclear how the system would know how to get to these areas. 5 Results We ran experiments with our DBN learning approach in the coffee task [2], the Taxi task =-=[17]-=-, and a simplified autonomous guided vehicle (AGV) task [18]. In each task, we compared our active learning scheme with passive learning, i.e., random action selection. In both cases, we used our appr... |

323 |
Estimating the dimension of a model
- Schwartz
- 1978
(Show Context)
Citation Context ...able. The minimum number is defined by a threshold parameter, and potential refinements are evaluated as soon as the threshold is exceeded. The algorithm uses the Bayesian Information Criterion (BIC) =-=[13]-=- and the likelihood-equivalent Bayesian Dirichlet metric (BDe) [9] to evaluate potential refinements. We assume that no data is available to begin with and develop a technique for active learning of D... |

277 |
GIROSO F.: Regularization algorithms for learning that are equivalent to multi-layer networks
- POGGIO
- 1990
(Show Context)
Citation Context ...uing to refine the trees. This is a serious issue since algorithms that take advantage of DBNs to solve factored MDPs depend on an accurate DBN model. We address this issue by applying regularization =-=[14]-=- to the BIC score. The BIC score is composed of alog likelihood term and a penalty term. This quantity fits nicely into the regularization framework if we multiply the penalty term by a parameter λ. ... |

224 | Exploiting structure in policy construction
- Dearden, R, et al.
- 1995
(Show Context)
Citation Context ...)}, where γ ∈ (0, 1] is a discount factor, by selecting action a k with probability π(s k , a k ) in each state s k . A factored MDP is described by a set of state variables S. We use the coffee task =-=[2]-=-, in which a robot has to deliver coffee to its user, as an example of a factored MDP. The coffee task is described by six binary state variables: SL, the robot’s location (office or coffee shop); SU,... |

73 |
D.: Learning Bayesian networks: Search methods and experimental results
- Chickering, Geiger, et al.
- 1995
(Show Context)
Citation Context ...N ijk + Nijk]) ′ ijk + Nijk) Γ (N ′ ijk ) , (2) i j where N ′ ijk are hyperparameters of a Dirichlet prior and Γ (x) is the Gamma function. Finding the BN with highest BIC or BDe score is NP-complete =-=[16]-=-. However, both scores decompose into a sum of terms for each variable Xi and each value j and k (we need to take the logarithm of BDe first). The score only changes locally when we add or remove edge... |

68 | Efficient Reinforcement Learning in Factored MDPs - Kearns, Koller - 1999 |

67 | Max-norm projections for factored MDPs - Guestrin, Koller, et al. - 2001 |

55 | Active learning for structure in Bayesian networks - Tong, Koller - 2001 |

37 | Active learning of causal bayes net structure - Murphy - 2001 |

23 | Causal Graph Based Decomposition of Factored MDPs - Jonsson, Barto - 2006 |

18 | Continuous-time hierarchical reinforcement learning
- Mahadevan
- 2002
(Show Context)
Citation Context ...tem would know how to get to these areas. 5 Results We ran experiments with our DBN learning approach in the coffee task [2], the Taxi task [17], and a simplified autonomous guided vehicle (AGV) task =-=[18]-=-. In each task, we compared our active learning scheme with passive learning, i.e., random action selection. In both cases, we used our approach for growing the conditional probability trees to implic... |

16 | Symbolic generalization for on-line planning
- Feng, Hansen, et al.
- 2003
(Show Context)
Citation Context ... by a parameter λ. We can multiply the penalty term of the BIC score by a parameter λ to put it in the form of a fidelity term and a stabilizer term: log[P (D | B)P (B)] ≈ L(D | B) − λ |θ| 2 log |D|, =-=(3)-=- such that λ controls the magnitude of the penalty for having many parameters. 4.1 Active learning Efficient data collection should gather sufficient data as quickly as possible. Since our algorithm r... |

13 | Unsupervised active learning in large domains - Steck, Jaakkola - 2002 |

1 |
Theory refinement on Bayesian networks
- Buntime
- 1991
(Show Context)
Citation Context ...ove edges between variables in G. Researchers have developed hill-climbing algorithms that perform greedy search to find high-scoring BNs by repeatedly adding or removing edges between variables in G =-=[7, 9]-=-. These algorithms have been extended to DBNs [8]. As an example, consider a BN with two binary variables X1 and X2. Assume that we have collected three data points (0, 0), (0, 1), and (1, 1). Also as... |