Results 1  10
of
15
Generalized markov decision processes: dynamicprogramming and reinforcementlearning algorithms
 in: Proceedings of the 13th International Conference of Machine Learning (ICML96
, 1996
"... The problem of maximi7,ing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP ..."
Abstract

Cited by 24 (6 self)
 Add to MetaCart
The problem of maximi7,ing the expected total discounted reward in a completely observable Markovian environment, i.e., a Markov decision process (MDP), models a particular class of sequential decision problems. Algorithms have been developed for making optimal decisions in MDPs given either an MDP specification or the opportunity to interact with the MDP over time. Recently, other sequential decisionmaking problems have been studied prompting the development of new algorithms and analyses. We describe a new generalized model that subsumes MDPs as well as many of the recent variations. We prove some basic results concerning this model and develop generalizations of value iteration, policy iteration, modelbased reinforcementlearning, and Qlcarning that can be used to make optimal dccisions in the generali7,ed model undcr various assumptions. Applications of the theory to particular models are described, including riskaverse MDPs, explorationsensitive MDPs, sarsa, Qlcarning with spreading, twoplayer games, and approximate max picking via sampling. Central to the results are the contraction property of the value operator and a stochasticapproximation theorCIn that reduces asynchronous convergence to synchronous convergence. 1 1
Multicriteria Reinforcement Learning
, 1998
"... We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology int ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology introduced by pointwise convergence and the ordertopology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. It is observed that in the mediumterm multicriteria RL often converges to better solutions (measured by the first criterion) than their singlecriterion counterparts. These type of multicriteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Example applications include alternating games, when in addition...
NonMarkovian Policies in Sequential Decision Problems
, 1997
"... In this article we prove the validity of the Bellman Optimality Equation and related results for sequential decision problems with a general recursive structure. The characteristic feature of our approach is that also nonMarkovian policies are taken into account. The theory is motivated by some exp ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
In this article we prove the validity of the Bellman Optimality Equation and related results for sequential decision problems with a general recursive structure. The characteristic feature of our approach is that also nonMarkovian policies are taken into account. The theory is motivated by some experiments with a learning robot. 1 Introduction The theory of sequential decision problems is an important mathematical tool for studying some problems of cybernetics, e.g. control of robots. Consider for example the robot shown in Figure 1. This robot, called Khepera 1 , is equipped with eight infrared sensors, six in the front and two at the back, the infrared sensors measuring the proximity of objects in the range 05 cm. The robot has two wheels driven by two independent DC motors and a gripper that has two degrees of freedom and is equipped with a resistivity sensor and an objectpresence sensor. The robot has a vision turret mounted on its top as. The vision turret has an image se...
Some Basic Facts Concerning Minimax Sequential Decision Processes
, 1996
"... this report. The interested reader may found the proofs (in a more general form) in [3]. Definition 6.1 Let T : R ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
this report. The interested reader may found the proofs (in a more general form) in [3]. Definition 6.1 Let T : R
Certainty Equivalent Policies Are SelfOptimizing Under Minimax Optimality
, 1996
"... . We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the value function yields asymptotically optimal policies under the minimax optimality criterion, within finite time with probability on ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
. We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the value function yields asymptotically optimal policies under the minimax optimality criterion, within finite time with probability one. From this it follows that learning and exploitation do not conflict under this special optimality criterion. We relate this result to learning optimal strategies in repeated twoplayer zerosum deterministic games. Keywords. nonBayesian adaptive control, selfoptimizing systems, repeated games, minimax control 1 Introduction We shall be concerned with the problem of finding an asymptotically optimal policy when the dynamics of the system under control is given in the form of controlled Markov chains with unknown parameters. The control policies we consider are adaptive: they "design" policies online based on information about the control problem that accumulates over time as the control...
Stability and Continuity of Nonlinear Model Predictive Control
, 1994
"... ... This work also demonstrates that analysis methods from dynamic programming can be used to analyze the model predictive control algorithm and subsume many standard results into a more general and comprehensive theory. This connection has not been explicitly stated in the literature to date and re ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
... This work also demonstrates that analysis methods from dynamic programming can be used to analyze the model predictive control algorithm and subsume many standard results into a more general and comprehensive theory. This connection has not been explicitly stated in the literature to date and remains a rich topic available for future research. Two results concerning stochastic or perturbed systems are presented. The first provides conditions under which an asymptotically stable control method can retain its stabilizing ability in the presence of perturbations arising from an exponentially stable state observer. The second examines the performance and demonstrates the suboptimality of model predictive control when applied to certain stochastic systems.
Distributed Asynchronous Policy Iteration in Dynamic Programming
"... Abstract — We consider the distributed solution of dynamic programming (DP) problems by policy iteration. We envision a network of processors, each updating asynchronously a local policy and a local cost function, defined on a portion of the state space. The computed values are communicated asynchro ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Abstract — We consider the distributed solution of dynamic programming (DP) problems by policy iteration. We envision a network of processors, each updating asynchronously a local policy and a local cost function, defined on a portion of the state space. The computed values are communicated asynchronously between processors and are used to perform the local policy and cost updates. The natural algorithm of this type can fail even under favorable circumstances, as shown by Williams and Baird [WiB93]. We propose an alternative and almost as simple algorithm, which converges to the optimum under the most general conditions, including asynchronous updating by multiple processors using outdated local cost functions of other processors. I.
Weighted SupNorm Contractions in Dynamic Programming: A Review and Some New Applications
"... We consider a class of generalized dynamic programming models based on weighted supnorm contractions. We provide an analysis that parallels the one available for discounted MDP and for generalized models based on unweighted supnorm contractions. In particular, we discuss the main properties and as ..."
Abstract
 Add to MetaCart
We consider a class of generalized dynamic programming models based on weighted supnorm contractions. We provide an analysis that parallels the one available for discounted MDP and for generalized models based on unweighted supnorm contractions. In particular, we discuss the main properties and associated algorithms of these models, including value iteration, policy iteration, and their optimistic and approximate variants. The analysis relies on several earlier works that use more specialized assumptions. In particular, we review and extend the classical results of Denardo [Den67] for unweighted supnorm contraction models, as well as more recent results relating to approximation methods for discounted MDP. We also apply the analysis to stochastic shortest path problems where all policies are assumed proper. For these problems we extend three results that are known for discounted MDP. The first relates to the convergence of optimistic policy iteration and extends a result of Rothblum [Rot79], the second relates to error bounds for approximate policy iteration and extends a result of Bertsekas and Tsitsiklis [BeT96], and the third relates to error bounds for approximate optimistic policy iteration and extends a result of Thiery and Scherrer [ThS10b].
Generalized Markov Decision Processes: Dynamicprogramming and Reinforcementlearning Algorithms
, 1997
"... 1 ..."