Results 1  10
of
304
An application of reinforcement learning to aerobatic helicopter flight
 In Advances in Neural Information Processing Systems 19
, 2007
"... Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tailin funnel, and nosein funne ..."
Abstract

Cited by 126 (10 self)
 Add to MetaCart
(Show Context)
Autonomous helicopter flight is widely regarded to be a highly challenging control problem. This paper presents the first successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward flip and sideways roll at low speed, tailin funnel, and nosein funnel. Our experimental results significantly extend the state of the art in autonomous helicopter flight. We used the following approach: First we had a pilot fly the helicopter to help us find a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to find a controller that is optimized for the resulting model and reward function. More specifically, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1
Autonomous Helicopter Control using Reinforcement Learning Policy Search Methods
 In International Conference on Robotics and Automation
, 2001
"... Many control problems in the robotics field can be cast as Partially Observed Markovian Decision Problems (POMDPs), an optimal control formalism. Finding optimal solutions to such problems in general, however is known to be intractable. It has often been observed that in practice, simple structured ..."
Abstract

Cited by 116 (1 self)
 Add to MetaCart
(Show Context)
Many control problems in the robotics field can be cast as Partially Observed Markovian Decision Problems (POMDPs), an optimal control formalism. Finding optimal solutions to such problems in general, however is known to be intractable. It has often been observed that in practice, simple structured controllers suffice for good suboptimal control, and recent research in the artificial intelligence community has focused on policy search methods as techniques for finding suboptimal controllers when such structured controllers do exist. Traditional modelbased reinforcement learning algorithms make a certainty equivalence assumption on their learned models and calculate optimal policies for a maximumlikelihood Markovian model. In this work, we consider algorithms that evaluate and synthesize controllers under distributions of Markovian models. Previous work has demonstrated that algorithms that maximize mean reward with respect to model uncertainty leads to safer and more robust controll...
Exploration and apprenticeship learning in reinforcement learning
 In ICML
, 2005
"... We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn nearoptimal policies by using “exploration policies ” to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impracti ..."
Abstract

Cited by 102 (3 self)
 Add to MetaCart
(Show Context)
We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn nearoptimal policies by using “exploration policies ” to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain nearoptimal performance (compared to the teacher) simply by repeatedly executing “exploitation policies ” that try to maximize rewards. In finitestate MDPs, our algorithm scales polynomially in the number of states; in continuousstate linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses. 1.
The jackknifea review
 Biometrika
, 1974
"... Interleukin (IL)33 is a new member of the IL1 superfamily of cytokines that is expressed by mainly stromal cells, such as epithelial and endothelial cells, and its expression is upregulated following proinflammatory stimulation. IL33 can function both as a traditional cytokine and as a nuclear f ..."
Abstract

Cited by 101 (0 self)
 Add to MetaCart
(Show Context)
Interleukin (IL)33 is a new member of the IL1 superfamily of cytokines that is expressed by mainly stromal cells, such as epithelial and endothelial cells, and its expression is upregulated following proinflammatory stimulation. IL33 can function both as a traditional cytokine and as a nuclear factor regulating gene transcription. It is thought to function as an ‘alarmin ’ released following cell necrosis to alerting the immune system to tissue damage or stress. It mediates its biological effects via interaction with the receptors ST2 (IL1RL1) and IL1 receptor accessory protein (IL1RAcP), both of which are widely expressed, particularly by innate immune cells and T helper 2 (Th2) cells. IL33 strongly induces Th2 cytokine production from these cells and can promote the pathogenesis of Th2related disease such as asthma, atopic dermatitis and anaphylaxis. However, IL33 has shown various protective effects in cardiovascular diseases such as atherosclerosis, obesity, type 2 diabetes and cardiac remodeling. Thus, the effects of IL33 are either pro or antiinflammatory depending on the disease and the model. In this review the role of IL33 in the inflammation of several disease pathologies will be discussed, with particular emphasis on recent advances.
Nearoptimal Regret Bounds for Reinforcement Learning
"... For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ..."
Abstract

Cited by 95 (11 self)
 Add to MetaCart
(Show Context)
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm. 1
Action Elimination and Stopping Conditions for the MultiArmed Bandit and . . .
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2006
"... We incorporate statistical confidence intervals in both the multiarmed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ) log(1/d) times to find an eoptimal arm with probability of at least 1d. Thi ..."
Abstract

Cited by 82 (5 self)
 Add to MetaCart
We incorporate statistical confidence intervals in both the multiarmed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ) log(1/d) times to find an eoptimal arm with probability of at least 1d. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Qfunction and eliminating actions that are not optimal (with high probability). We provide a modelbased and a modelfree variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over egreedy Qlearning.
A theoretical analysis of modelbased interval estimation
 Proceedings of the Twentysecond International Conference on Machine Learning (ICML05
, 2005
"... Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper pres ..."
Abstract

Cited by 81 (10 self)
 Add to MetaCart
(Show Context)
Several algorithms for learning nearoptimal policies in Markov Decision Processes have been analyzed and proven efficient. Empirical results have suggested that Modelbased Interval Estimation (MBIE) learns efficiently in practice, effectively balancing exploration and exploitation. This paper presents the first theoretical analysis of MBIE, proving its efficiency even under worstcase conditions. The paper also introduces a new performance metric, average loss, and relates it to its less “online ” cousins from the literature. 1.
Using relative novelty to identify useful temporal abstractions in reinforcement learning
 In Proceedings of the TwentyFirst International Conference on Machine Learning
, 2004
"... We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative nov ..."
Abstract

Cited by 79 (12 self)
 Add to MetaCart
We present a new method for automatically creating useful temporal abstractions in reinforcement learning. We argue that states that allow the agent to transition to a different region of the state space are useful subgoals, and propose a method for identifying them using the concept of relative novelty. When such a state is identified, a temporallyextended activity (e.g., an option) is generated that takes the agent efficiently to this state. We illustrate the utility of the method in a number of tasks. 1.