Results 1  10
of
47
Reinforcement Learning I: Introduction
, 1998
"... In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search ..."
Abstract

Cited by 5500 (120 self)
 Add to MetaCart
In which we try to give a basic intuitive sense of what reinforcement learning is and how it differs and relates to other fields, e.g., supervised learning and neural networks, genetic algorithms and artificial life, control theory. Intuitively, RL is trial and error (variation and selection, search) plus learning (association, memory). We argue that RL is the only field that seriously addresses the special features of the problem of learning from interaction to achieve longterm goals.
Active Learning with Statistical Models
, 1995
"... For manytypes of learners one can compute the statistically "optimal" way to select data. We review how these techniques have been used with feedforward neural networks [MacKay, 1992# Cohn, 1994]. We then showhow the same principles may be used to select data for two alternative, statist ..."
Abstract

Cited by 677 (12 self)
 Add to MetaCart
(Show Context)
For manytypes of learners one can compute the statistically "optimal" way to select data. We review how these techniques have been used with feedforward neural networks [MacKay, 1992# Cohn, 1994]. We then showhow the same principles may be used to select data for two alternative, statisticallybased learning architectures: mixtures of Gaussians and locally weighted regression. While the techniques for neural networks are expensive and approximate, the techniques for mixtures of Gaussians and locally weighted regression are both efficient and accurate.
Exploration of MultiState Environments: Local Measures and BackPropagation of Uncertainty
, 1998
"... . This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a ..."
Abstract

Cited by 52 (1 self)
 Add to MetaCart
. This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Qlearning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The rst is to dene a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular congurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and backpropagating them with the dynamic programming or temporal dierence mechanisms. This allows reproducing globalscale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the eciency of these propositions. Keywords: ...
Exploration bonuses and dual control
 MACHINE LEARNING
, 1996
"... Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, GonzalezZubieta & Miller, 1965) arising from a form of dual c ..."
Abstract

Cited by 40 (2 self)
 Add to MetaCart
(Show Context)
Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, GonzalezZubieta & Miller, 1965) arising from a form of dual control. This systematizes and extends existing uses of exploration bonuses in reinforcement learning (Sutton, 1990). The approach has two components: a statistical model of uncertainty in the world and a way of turning this into exploratory behavior. This general approach is applied to twodimensional mazes with moveable barriers and its performance is compared with Sutton’s DYNA system.
Control of ExploitationExploration MetaParameter in Reinforcement Learning
 Neural Networks
, 2002
"... In reinforcement learning, the duality between exploitation and exploration has long been an important issue. This paper presents a new method that controls the balance between exploitation and exploration. Our learning scheme is based on modelbased reinforcement learning, in which the Bayes inf ..."
Abstract

Cited by 31 (1 self)
 Add to MetaCart
(Show Context)
In reinforcement learning, the duality between exploitation and exploration has long been an important issue. This paper presents a new method that controls the balance between exploitation and exploration. Our learning scheme is based on modelbased reinforcement learning, in which the Bayes inference with forgetting effect estimates the statetransition probability of the environment. The balance parameter, which corresponds to the randomness in action selection, is controlled based on variation of action results and perception of environmental change. When applied to maze tasks, our method successfully obtains good controls by adapting to environmental changes. Recently, Usher et al. [60] has suggested that noradrenergic neurons in the locus coeruleus may control the exploitationexploration balance in a real brain and that the balance may correspond to the level of animal's selective attention. According to this scenario, we also discuss a possible implementation in the brain.
Recursive Lazy Learning for Modeling and Control
 IN: MACHINE LEARNING: ECML98 (10TH EUROPEAN CONFERENCE ON MACHINE LEARNING
, 1998
"... This paper presents a local method for modeling and control of nonlinear dynamical systems, when only a limited amount of inputoutput data is available. The proposed methodology couples a local model identification inspired by the lazy learning technique, with a control strategy based on linear o ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
This paper presents a local method for modeling and control of nonlinear dynamical systems, when only a limited amount of inputoutput data is available. The proposed methodology couples a local model identification inspired by the lazy learning technique, with a control strategy based on linear optimal control theory. The local modeling procedure uses a querybased approach to select the best model configuration by assessing and comparing different alternatives. A new recursive technique for local model identification and validation is presented, together with an enhanced statistical method for model selection. The control method combines the linearization provided by the local learning techniques with optimal linear control theory, to control non linear systems in far from equilibrium configurations. Simulation results of the identification of a nonlinear benchmark model and of the control of a complex nonlinear system (the bioreactor) are presented.
Local Bandit Approximation for Optimal Learning Problems
 Advances in Neural Information Processing Systems 9
, 1997
"... In general, procedures for determining Bayesoptimal adaptive controls for Markov decision processes (MDP's) require a prohibitive amount of computationthe optimal learning problem is intractable. This paper proposes an approximate approach in which bandit processes are used to model, in a c ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
In general, procedures for determining Bayesoptimal adaptive controls for Markov decision processes (MDP's) require a prohibitive amount of computationthe optimal learning problem is intractable. This paper proposes an approximate approach in which bandit processes are used to model, in a certain "local" sense, a given MDP. Bandit processes constitute an important subclass of MDP's, and have optimal learning strategies (defined in terms of Gittins indices) that can be computed relatively efficiently. Thus, one scheme for achieving approximatelyoptimal learning for general MDP's proceeds by taking actions suggested by strategies that are optimal with respect to local bandit models. 1 INTRODUCTION Watkins [1989] has defined optimal learning as: "... the process of collecting and using information during learning in an optimal manner, so that the learner makes the best possible decisions at all stages of learning: learning itself is regarded as a multistage decision process, and lea...
Lazy Learning for Local Modeling and Control Design
 INTERNATIONAL JOURNAL OF CONTROL. ACCEPTED
, 1997
"... This paper presents local methods for modeling and control of discretetime unknown nonlinear dynamical systems, when only a limited amount of inputoutput data is available. We propose the adoption of lazy learning, a memorybased technique for local modeling. The modeling procedure uses a queryba ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
(Show Context)
This paper presents local methods for modeling and control of discretetime unknown nonlinear dynamical systems, when only a limited amount of inputoutput data is available. We propose the adoption of lazy learning, a memorybased technique for local modeling. The modeling procedure uses a querybased approach to select the best model configuration by assessing and comparing different alternatives. A new recursive technique for local model identification and validation is presented, together with an enhanced statistical method for model selection. Also, three methods to design controllers based on the local linearization provided by the lazy learning algorithm are described. In the first method the lazy technique returns the forward and inverse models of the system which are used to compute the control action to take. The second is an indirect method inspired to selftuning regulators where recursive least squares estimation is replaced by a local approximator. The third method combin...
Pricing Information Bundles in a Dynamic Environment
 Third ACM Conference on Electronic Commerce
, 2001
"... Markets for digital information goods provide the possibility of exploring new and more complex pricing schemes, due to information goods ’ flexibilityandnegligiblemarginalcost. Inthispaperwe compare the dynamic performance of price schedules of varying complexity under two different specifications ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
Markets for digital information goods provide the possibility of exploring new and more complex pricing schemes, due to information goods ’ flexibilityandnegligiblemarginalcost. Inthispaperwe compare the dynamic performance of price schedules of varying complexity under two different specifications of consumer demand shifts. A monopolist producer employs asimple directsearch method that seeks to maximize profits using various price schedules. We find that the complexity of the price schedule affects both the amount of exploration necessary and the aggregate profit received by aproducer. The size of the bundle offered, the rate of population change, and the number of iterations a producer can expect to interact with a population in total all affect the choice of schedule. If the number of iterations is small, a producer is best off randomly choosing a highdimensional schedule, particularly when the bundle size is large. As the number of interactions between the producer and a given consumer population increases, then twoparameter schedules begin to perform best, as their learnability allows the producer to find highly optimal prices quickly. Our results have implications for automated learning and strategic pricing in nonstationary environments arising from changes in the consumer population, in individuals ’ preferences, or in the strategies of competing firms. 1
Pricing Information Bundles in a Dynamic Environment
, 2001
"... We explore a scenario in which a monopolist producer of information goods seeks to maximize its profits in a market where consumer demand shifts frequently and unpredictably. The producer may set an arbitrarily complex price schedulea function that maps the set of purchased items to a price. Howe ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
We explore a scenario in which a monopolist producer of information goods seeks to maximize its profits in a market where consumer demand shifts frequently and unpredictably. The producer may set an arbitrarily complex price schedulea function that maps the set of purchased items to a price. However, lacking direct knowledge of consumer demand, it cannot compute the optimal schedule. Instead, it attempts to optimize profits via trial and error. By means of a simple model of consumer demand and a modified version of a simple nonlinear optimization routine, we study a variety of parametrizations of the price schedule and quantify some of the relationships among learnability, complexity, and profitability. In particular, we show that fixed pricing or simple twoparameter dynamic pricing schedules are preferred when demand shifts frequently, but that dynamic pricing based on more complex schedules tends to be most profitable when demand shifts very infrequently.