Results 1 
5 of
5
Multicriteria Reinforcement Learning
, 1998
"... We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology int ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology introduced by pointwise convergence and the ordertopology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. It is observed that in the mediumterm multicriteria RL often converges to better solutions (measured by the first criterion) than their singlecriterion counterparts. These type of multicriteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Example applications include alternating games, when in addition...
Constrained Discounted Dynamic Programming
 MATH. OF OPERATIONS RESEARCH
, 1996
"... This paper deals with constrained optimization of Markov Decision Processes with a countable state space, compact action sets, continuous transition probabilities, and upper semicontinuous reward functions. The objective is to maximize the expected total discounted reward for one reward function, u ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
This paper deals with constrained optimization of Markov Decision Processes with a countable state space, compact action sets, continuous transition probabilities, and upper semicontinuous reward functions. The objective is to maximize the expected total discounted reward for one reward function, under several inequality constraints on similar criteria with other reward functions. Sippose a
Multiple objective nonatomic Markov decision processes with total reward criteria
, 2000
"... We consider a Markov decision process with an uncountable state space and multiple rewards. For each policy, its performance is evaluated by a vector of total expected rewards. Under the standard continuity assumptions and the additional assumption that all initial and transition probabilities are n ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We consider a Markov decision process with an uncountable state space and multiple rewards. For each policy, its performance is evaluated by a vector of total expected rewards. Under the standard continuity assumptions and the additional assumption that all initial and transition probabilities are nonatomic, we prove that the set of performance vectors for all policies is equal to the set of performance vectors for (nonrandomized) Markov policies. This result implies the existence of optimal (nonrandomized) Markov policies for nonatomic constrained Markov decision processes with total rewards. We provide two examples of applications of our results to constrained multiple objective problems in inventory control and finance.
Multicriteria Reinforcement Learning
, 1998
"... We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology int ..."
Abstract
 Add to MetaCart
We consider multicriteria sequential decision making problems where the vectorvalued evaluations are compared by a given, fixed total ordering. Conditions for the optimality of stationary policies and the Bellman optimality equation are given. The analysis requires special care as the topology introduced by pointwise convergence and the ordertopology introduced by the preference order are in general incompatible. Reinforcement learning algorithms are proposed and analyzed. Preliminary computer experiments confirm the validity of the derived algorithms. It is observed that in the mediumterm multicriteria RL often converges to better solutions (measured by the first criterion) than their singlecriterion counterparts. These type of multicriteria problems are most useful when there are several optimal solutions to a problem and one wants to choose the one among these which is optimal according to another fixed criterion. Example applications include alternating games, when in addition...
Markov Decision Processes with Constrained Stopping Times
"... The optimization problem for a stopped Markov decision process is considered to be taken over stopping times constrained so that E 5 ff for some fixed ff ? 0. We introduce the concept of a randomized stationary stopping time which is a mixed extension of the entry time of a stopping region and pr ..."
Abstract
 Add to MetaCart
The optimization problem for a stopped Markov decision process is considered to be taken over stopping times constrained so that E 5 ff for some fixed ff ? 0. We introduce the concept of a randomized stationary stopping time which is a mixed extension of the entry time of a stopping region and prove the existence of an optimal constrained pair of stationary policy and stopping time by utilizing a Lagrange multiplier approach. Also, applying the idea of the onestep look ahead (OLA) policy the optimal constrained pair is sought concretely. As an example, constrained Markov deteriorating system is explained. Key words: Markov decision process, constrained stopping time, Lagrange multiplier, OLA policy 1 Introduction A constrained optimal stopping problem is originated by Nachman [15] and Kennedy [13], in which a Lagrangian approach has used to reduce the problem to an unconstrained stopping problem of a conventional type and the constrained optimal stopping time is characterized. Als...