Results 1  10
of
13
Dynamic Preferences in MultiCriteria Reinforcement Learning
 In Proceedings of ICML05
, 2005
"... The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent’s preferences between different objectives may vary with time. In this paper, ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent’s preferences between different objectives may vary with time. In this paper, we consider the problem of learning in the presence of timevarying preferences among multiple objectives, using numeric weights to represent their importance. We propose a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it. The idea is that although there are infinitely many weight vectors, they may be wellcovered by a small number of optimal policies. We show this empirically in two domains: a version of the Buridan’s ass problem and network routing. 1.
Learning All Optimal Policies with Multiple Criteria
"... We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard rei ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
We describe an algorithm for learning in the presence of multiple criteria. Our technique generalizes previous approaches in that it can learn optimal policies for all linear preference assignments over the multiple reward criteria at once. The algorithm can be viewed as an extension to standard reinforcement learning for MDPs where instead of repeatedly backing up maximal expected rewards, we back up the set of expected rewards that are maximal for some set of linear preferences (given by a weight vector, − → w). We present the algorithm along with a proof of correctness showing that our solution gives the optimal policy for any linear preference function. The solution reduces to the standard value iteration algorithm for a specific weight vector, − → w. 1.
Online Learning with Sample Path Constraints
"... We study online learning when the objective of the decision maker is to maximize her longterm average reward subject to certain sample path average constraints. We define the rewardinhindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
We study online learning when the objective of the decision maker is to maximize her longterm average reward subject to certain sample path average constraints. We define the rewardinhindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the rewardinhindsight is not attainable. The convex hull of the rewardinhindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on noncalibrated forecasters in experiments involving a CPU power management problem. 1.
Xilinx Inc.: XPower Tutorial: FPGA Design, XPower
 In Proceedings of 19th Annual Conference on Learning Theory
, 2001
"... Abstract. We study online learning where the objective of the decision maker is to maximize her average longterm reward given that some average constraints are satisfied along the sample path. We define the rewardinhindsight as the highest reward the decision maker could have achieved, while sati ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract. We study online learning where the objective of the decision maker is to maximize her average longterm reward given that some average constraints are satisfied along the sample path. We define the rewardinhindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the rewardinhindsight is not attainable. The convex hull of the rewardinhindsight function is, however, attainable. For the important case of a single constraint the convex hull turns out to be the highest attainable function. We further provide an explicit strategy that attains this convex hull using a calibrated forecasting rule. 1
Reinforcement Learning Without Rewards
, 2010
"... Machine learning can be broadly defined as the study and design of algorithms thatimprovewithexperience. Reinforcement learning isavarietyofmachinelearning that makes minimal assumptions about the information available for learning, and, in a sense, defines the problem of learning in the broadest po ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Machine learning can be broadly defined as the study and design of algorithms thatimprovewithexperience. Reinforcement learning isavarietyofmachinelearning that makes minimal assumptions about the information available for learning, and, in a sense, defines the problem of learning in the broadest possible terms. Reinforcement learning algorithms are usually applied to “interactive” problems, such as learning to drive a car, operate a robotic arm, or play a game. In reinforcement learning, an autonomous agent must learn how to behave in an unknown, uncertain, and possibly hostile environment, usingonly thesensory feedbackthat it receives from theenvironment. As the agent moves from one state of the environment to another, it receives only a reward signal — there is no human “in the loop ” to tell the algorithm exactly what to do. The goal in reinforcement learning is to learn an optimal behavior that maximizes the total reward that the agent collects. Despite its generality, the reinforcement learning framework does make one strong assumption: that the reward signal can always be directly and unambiguously observed. In other words, the feedback a reinforcement learning algorithm receives is
Computing Optimal Stationary Policies for MultiObjective Markov Decision Processes
"... Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract — This paper describes a novel algorithm called CONMODP for computing Pareto optimal policies for deterministic multiobjective sequential decision problems. CONMODP is a value iteration based multiobjective dynamic programming algorithm that only computes stationary policies. We observe that for guaranteeing convergence to the unique Pareto optimal set of deterministic stationary policies, the algorithm needs to perform a policy evaluation step on particular policies that are inconsistent in a single state that is being expanded. We prove that the algorithm converges to the Pareto optimal set of value functions and policies for deterministic infinite horizon discounted multiobjective Markov decision processes. Experiments show that CONMODP is much faster than previous multiobjective value iteration algorithms. I.
Linear fittedq iteration with multiple reward functions
 Journal of Machine Learning Research
"... We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using t ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We present a general and detailed development of an algorithm for finitehorizon fittedQ iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidencebased clinical decision support.
Stochastic Bandits with Pathwise Constraints
"... Abstract. We consider the problem of stochastic bandits, with the goal of maximizing a reward while satisfying pathwise constraints. The motivation for this problem comes from cognitive radio networks, in which agents need to choose between different transmission profiles to maximize throughput unde ..."
Abstract
 Add to MetaCart
Abstract. We consider the problem of stochastic bandits, with the goal of maximizing a reward while satisfying pathwise constraints. The motivation for this problem comes from cognitive radio networks, in which agents need to choose between different transmission profiles to maximize throughput under certain operational constraints such as limited average power. Stochastic bandits serve as a natural model for an unknown, stationary environment. We propose an algorithm, based on a steering approach, and analyze its regret with respect to the optimal stationary policy that knows the statistics of the different arms. 1
Machine Learning manuscript No. (will be inserted by the editor) Towards PreferenceBased Reinforcement Learning
"... the date of receipt and acceptance should be inserted later Abstract This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preferencebased approach to reinforcement learnin ..."
Abstract
 Add to MetaCart
the date of receipt and acceptance should be inserted later Abstract This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preferencebased approach to reinforcement learning is the observation that in many realworld domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs of conventional RL algorithms. Instead, we propose an alternative framework for reinforcement learning, in which qualitative reward signals can be directly used by the learner. The framework may be viewed as a generalization of the conventional RL framework in which only a partial order between policies is required instead of the total order induced by their respective expected longterm reward. Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. As a proof of concept, we realize a first simple instantiation of this framework that defines preferences based on utilities observed for trajectories. To that end, we build on an existing method for approximate policy iteration based on rollouts. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of preferencebased policy iteration are illustrated by means of two case studies.
Multiobjective Reinforcement Learning Using Adaptive Dynamic Programming And Reservoir Computing
"... Abstract. This paper introduces a multiobjective reinforcement learning approach which is suitable for large state and action spaces. The approach is based on actorcritic design and reservoir computing. A single reservoir estimates several utilities simultaneously and provides their gradients that a ..."
Abstract
 Add to MetaCart
Abstract. This paper introduces a multiobjective reinforcement learning approach which is suitable for large state and action spaces. The approach is based on actorcritic design and reservoir computing. A single reservoir estimates several utilities simultaneously and provides their gradients that are required for the actor enabling an agent to adapt its behavior in presence of several sources of rewards. We describe the approach in theoretical terms, supported by simulation results. 1