Results 1  10
of
5,613
Finitetime analysis of the multiarmed bandit problem
 Machine Learning
, 2002
"... Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing ..."
Abstract

Cited by 817 (15 self)
 Add to MetaCart
Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy’s success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multiarmed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support. Keywords: bandit problems, adaptive allocation rules, finite horizon regret 1.
Between MDPs and SemiMDPs: A Framework for Temporal Abstraction in Reinforcement Learning
, 1999
"... Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We exte ..."
Abstract

Cited by 569 (38 self)
 Add to MetaCart
(Show Context)
Learning, planning, and representing knowledge at multiple levels of temporal abstraction are key, longstanding challenges for AI. In this paper we consider how these challenges can be addressed within the mathematical framework of reinforcement learning and Markov decision processes (MDPs). We extend the usual notion of action in this framework to include optionsclosedloop policies for taking action over a period of time. Examples of options include picking up an object, going to lunch, and traveling to a distant city, as well as primitive actions such as muscle twitches and joint knowledge and action to be included in the reinforcement learning framework in a natural and general way. In particular, we show that options may be used interchangeably with primitive actions in planning methods such as dynamic programming and in learning methods such as Qlearning. Formally, a set of options defined
DecisionTheoretic Planning: Structural Assumptions and Computational Leverage
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 1999
"... Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives ..."
Abstract

Cited by 515 (4 self)
 Add to MetaCart
Planning under uncertainty is a central problem in the study of automated sequential decision making, and has been addressed by researchers in many different fields, including AI planning, decision analysis, operations research, control theory and economics. While the assumptions and perspectives adopted in these areas often differ in substantial ways, many planning problems of interest to researchers in these fields can be modeled as Markov decision processes (MDPs) and analyzed using the techniques of decision theory. This paper presents an overview and synthesis of MDPrelated methods, showing how they provide a unifying framework for modeling many classes of planning problems studied in AI. It also describes structural properties of MDPs that, when exhibited by particular classes of problems, can be exploited in the construction of optimal or approximately optimal policies or plans. Planning problems commonly possess structure in the reward and value functions used to de...
Ant algorithms for discrete optimization
 ARTIFICIAL LIFE
, 1999
"... This article presents an overview of recent work on ant algorithms, that is, algorithms for discrete optimization that took inspiration from the observation of ant colonies’ foraging behavior, and introduces the ant colony optimization (ACO) metaheuristic. In the first part of the article the basic ..."
Abstract

Cited by 489 (42 self)
 Add to MetaCart
(Show Context)
This article presents an overview of recent work on ant algorithms, that is, algorithms for discrete optimization that took inspiration from the observation of ant colonies’ foraging behavior, and introduces the ant colony optimization (ACO) metaheuristic. In the first part of the article the basic biological findings on real ants are reviewed and their artificial counterparts as well as the ACO metaheuristic are defined. In the second part of the article a number of applications of ACO algorithms to combinatorial optimization and routing in communications networks are described. We conclude with a discussion of related work and of some of the most important aspects of the ACO metaheuristic.
LeastSquares Policy Iteration
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach ..."
Abstract

Cited by 462 (12 self)
 Add to MetaCart
(Show Context)
We propose a new approach to reinforcement learning for control problems which combines valuefunction approximation with linear architectures and approximate policy iteration. This new approach
Policy gradient methods for reinforcement learning with function approximation.
 In NIPS,
, 1999
"... Abstract Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly repres ..."
Abstract

Cited by 439 (20 self)
 Add to MetaCart
(Show Context)
Abstract Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actorcritic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate actionvalue or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Large applications of reinforcement learning (RL) require the use of generalizing function approximators such neural networks, decisiontrees, or instancebased methods. The dominant approach for the last decade has been the valuefunction approach, in which all function approximation effort goes into estimating a value function, with the actionselection policy represented implicitly as the "greedy" policy with respect to the estimated values (e.g., as the policy that selects in each state the action with highest estimated value). The valuefunction approach has worked well in many applications, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e.g., see In this paper we explore an alternative approach to function approximation in RL. Rather than approximating a value function and using that to compute a deterministic policy, we approximate a stochastic policy directly using an independent function approximator with its own parameters. For example, the policy might be represented by a neural network whose input is a representation of the state, whose output is action selection probabilities, and whose weights are the policy parameters. Let θ denote the vector of policy parameters and ρ the performance of the corresponding policy (e.g., the average reward per step). Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient: where α is a positivedefinite step size. If the above can be achieved, then θ can usually be assured to converge to a locally optimal policy in the performance measure ρ. Unlike the valuefunction approach, here small changes in θ can cause only small changes in the policy and in the statevisitation distribution. In this paper we prove that an unbiased estimate of the gradient (1) can be obtained from experience using an approximate value function satisfying certain properties. Our result also suggests a way of proving the convergence of a wide variety of algorithms based on "actorcritic" or policyiteration architectures (e.g., Policy Gradient Theorem We consider the standard reinforcement learning framework (see, e.g., Sutton and Barto, 1998), in which a learning agent interacts with a Markov decision process (MDP). The state, action, and reward at each time t ∈ {0, 1, 2, . . .} are denoted s t ∈ S, a t ∈ A, and r t ∈ respectively. The environment's dynamics are characterized by state transition probabilities, P a ss = P r {s t+1 = s  s t = s, a t = a}, and expected rewards R a s = E {r t+1  s t = s, a t = a}, ∀s, s ∈ S, a ∈ A. The agent's decision making procedure at each time is characterized by a policy, π(s, a, θ) = P r {a t = as t = s, θ}, ∀s ∈ S, a ∈ A, where θ ∈ l , for l << S, is a parameter vector. We assume that π is diffentiable with respect to its parameter, i.e., that ∂π(s,a) ∂θ exists. We also usually write just π(s, a) for π(s, a, θ). With function approximation, two ways of formulating the agent's objective are useful. One is the average reward formulation, in which policies are ranked according to their longterm expected reward per step, ρ(π): where d π (s) = lim t→∞ P r {s t = ss 0 , π} is the stationary distribution of states under π, which we assume exists and is independent of s 0 for all policies. In the average reward formulation, the value of a stateaction pair given a policy is defined as The second formulation we cover is that in which there is a designated start state s 0 , and we care only about the longterm reward obtained from it. We will give our results only once, but they will apply to this formulation as well under the definitions where γ ∈ [0, 1] is a discount rate (γ = 1 is allowed only in episodic tasks). In this formulation, we define d π (s) as a discounted weighting of states encountered starting at s 0 and then following π: d π (s) = ∞ t=0 γ t P r {s t = ss 0 , π}. Our first result concerns the gradient of the performance metric with respect to the policy parameter: Theorem 1 (Policy Gradient). For any MDP, in either the averagereward or startstate formulations, Proof: See the appendix. Marbach and Tsitsiklis (1998) describe a related but different expression for the gradient in terms of the statevalue function, citing Jaakkola, ∂θ : the effect of policy changes on the distribution of states does not appear. This is convenient for approximating the gradient by sampling. For example, if s was sampled from the distribution obtained by following π, then a ∂π(s,a) ∂θ Q π (s, a) would be an unbiased estimate of ∂ρ ∂θ . Of course, Q π (s, a) is also not normally known and must be estimated. One approach is to use the actual returns, corrects for the oversampling of actions preferred by π), which is known to follow ∂ρ ∂θ in expected value Policy Gradient with Approximation Now consider the case in which Q π is approximated by a learned function approximator. If the approximation is sufficiently good, we might hope to use it in place of Q π in (2) and still point roughly in the direction of the gradient. For example, Jaakkola, Let f w : S × A → be our approximation to Q π , with parameter w. It is natural to learn f w by following π and updating w by a rule such as ∆w t ∝
Searching in metric spaces
, 2001
"... The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather gen ..."
Abstract

Cited by 436 (38 self)
 Add to MetaCart
The problem of searching the elements of a set that are close to a given query element under some similarity criterion has a vast number of applications in many branches of computer science, from pattern recognition to textual and multimedia information retrieval. We are interested in the rather general case where the similarity criterion defines a metric space, instead of the more restricted case of a vector space. Many solutions have been proposed in different areas, in many cases without crossknowledge. Because of this, the same ideas have been reconceived several times, and very different presentations have been given for the same approaches. We present some basic results that explain the intrinsic difficulty of the search problem. This includes a quantitative definition of the elusive concept of “intrinsic dimensionality. ” We also present a unified
The neural basis of human error processing: Reinforcement learning, dopamine, and the errorrelated negativity
 PSYCHOLOGICAL REVIEW 109:679–709
, 2002
"... The authors present a unified account of 2 neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a “generic ” errorprocessing system associated with the anterior cingulate cortex. The existence of the error ..."
Abstract

Cited by 430 (20 self)
 Add to MetaCart
(Show Context)
The authors present a unified account of 2 neural systems concerned with the development and expression of adaptive behaviors: a mesencephalic dopamine system for reinforcement learning and a “generic ” errorprocessing system associated with the anterior cingulate cortex. The existence of the errorprocessing system has been inferred from the errorrelated negativity (ERN), a component of the eventrelated brain potential elicited when human participants commit errors in reactiontime tasks. The authors propose that the ERN is generated when a negative reinforcement learning signal is conveyed to the anterior cingulate cortex via the mesencephalic dopamine system and that this signal is used by the anterior cingulate cortex to modify performance on the task at hand. They provide support for this proposal using both computational modeling and psychophysiological experimentation. Human beings learn from the consequences of their actions. Thorndike (1911/1970) originally described this phenomenon with his law of effect, which made explicit the commonsense notion that actions that are followed by feelings of satisfaction are more likely to be generated again in the future, whereas actions that are followed by negative outcomes are less likely to reoccur. This
Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm
, 1998
"... In this paper, we adopt generalsum stochastic games as a framework for multiagent reinforcement learning. Our work extends previous work by Littman on zerosum stochastic games to a broader framework. We design a multiagent Qlearning method under this framework, and prove that it converges to a Na ..."
Abstract

Cited by 331 (4 self)
 Add to MetaCart
In this paper, we adopt generalsum stochastic games as a framework for multiagent reinforcement learning. Our work extends previous work by Littman on zerosum stochastic games to a broader framework. We design a multiagent Qlearning method under this framework, and prove that it converges to a Nash equilibrium under specified conditions. This algorithm is useful for finding the optimal strategy when there exists a unique Nash equilibrium in the game. When there exist multiple Nash equilibria in the game, this algorithm should be combined with other learning techniques to find optimal strategies.