Results 11  20
of
129
A Robust Geometric Approach to MultiCriterion Reinforcement Learning
 Journal of Machine Learning Research
, 2004
"... We consider the problem of reinforcement learning in a dynamic environment, where the learning objective is defined in terms of multiple reward functions of the average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, which are observ ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
We consider the problem of reinforcement learning in a dynamic environment, where the learning objective is defined in terms of multiple reward functions of the average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, which are observed but cannot be predicted in advance. We model this situation through a stochastic (Markov) game model, between the learning agent and an arbitrary player, with vectorvalued rewards. State recurrence conditions are imposed throughout. The objective of the learning agent is to have its longterm average reward vector belong to a desired target set. Starting with a given target set, we devise learning algorithms to achieve this task. These algorithms rely on learning algorithms for appropriately defined scalar rewards, together with the geometric insight of the theory of approachability for stochastic games. We then address the more general problem where the target set itself may depend on the model parameters, and hence is not known in advance to the learning agent. A particular case which falls into this framework is that of stochastic games with average reward constraints. Further specialization provides a reinforcement learning algorithm for constrained Markov decision processes. Some basic examples are provided to illustrate these results.
Efficient Bayesadaptive reinforcement learning using samplebased search
 In Neural Information Processing Systems
, 2012
"... Abstract Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayesoptimal policy captures the ideal tradeoff between exploration and exploitation. Unfortunately, finding Bayesoptimal policies is notor ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
Abstract Bayesian modelbased reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty. In this setting, a Bayesoptimal policy captures the ideal tradeoff between exploration and exploitation. Unfortunately, finding Bayesoptimal policies is notoriously taxing due to the enormous search space in the augmented beliefstate MDP. In this paper we exploit recent advances in samplebased planning, based on MonteCarlo tree search, to introduce a tractable method for approximate Bayesoptimal planning. Unlike prior work in this area, we avoid expensive applications of Bayes rule within the search tree, by lazily sampling models from the current beliefs. Our approach outperformed prior Bayesian modelbased RL algorithms by a significant margin on several wellknown benchmark problems.
Error Propagation for Approximate Policy and Value Iteration
"... We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared RadonNikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
From Bandits to MonteCarlo Tree Search: The Optimistic Principle Applied to Optimization and Planning
, 2013
"... sample ..."
(Show Context)
Inverse Reinforcement Learning through Structured Classification
"... This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the socalled feature expectation of the expert as the parameterization of the score fu ..."
Abstract

Cited by 14 (12 self)
 Add to MetaCart
(Show Context)
This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the socalled feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably nearoptimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
Optimistic planning for sparsely stochastic systems
 In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning
, 2011
"... AbstractWe propose an online planning algorithm for finiteaction, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansi ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
AbstractWe propose an online planning algorithm for finiteaction, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as modelpredictive (recedinghorizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
Linear fittedq iteration with multiple reward functions
 Journal of Machine Learning Research
"... We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using t ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidencebased clinical decision support.
Computing Optimal Stationary Policies for MultiObjective Markov Decision Processes
"... Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe that for guaranteeing convergence to the unique Pareto optimal set of deterministic stationary policies, the algorithm needs to perform a policy evaluation step on particular policies that are inconsistent in a single state that is being expanded. We prove that the algorithm converges to the Pareto optimal set of value functions and policies for deterministic infinite horizon discounted multiobjective Markov decision processes. Experiments show that CONMODP is much faster than previous multiobjective value iteration algorithms. I.
On the response of EMTbased control to interacting targets and models
 In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS06
, 2006
"... A novel control mechanism was recently introduced based on Extended Markov Tracking (EMT) [9, 10]. In this paper, we present a study of its response to multiple interacting control goals. We show a simple extension that can be integrated into EMTbased control, and which provides it with the ability ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
A novel control mechanism was recently introduced based on Extended Markov Tracking (EMT) [9, 10]. In this paper, we present a study of its response to multiple interacting control goals. We show a simple extension that can be integrated into EMTbased control, and which provides it with the ability to handle several behavioral targets. Experimental support for the validity of this extension is provided. We also describe an experiment with a simulated robot, where EMTbased controllers interact and interfere indirectly via the environment. Experiments support the resilience of multiagent EMTbased team control to potential conflicts that may appear within a team. 1.