Results 1  10
of
31
Linear hamilton jacobi bellman equations in high dimensions
 in Conference on Decision and Control (CDC), 2014, arXiv preprint arXiv:1404.1089
"... provides the globally optimal solution to large classes of control problems. Unfortunately, this generality comes at a price, the calculation of such solutions is typically intractible for systems with more than moderate state space size due to the curse of dimensionality. This work combines recent ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
provides the globally optimal solution to large classes of control problems. Unfortunately, this generality comes at a price, the calculation of such solutions is typically intractible for systems with more than moderate state space size due to the curse of dimensionality. This work combines recent results in the structure of the HJB, and its reduction to a linear Partial Differential Equation (PDE), with methods based on low rank tensor representations, known as a separated representations, to address the curse of dimensionality. The result is an algorithm to solve optimal control problems which scales linearly with the number of states in a system, and is applicable to systems that are nonlinear with stochastic forcing in finitehorizon, average cost, and firstexit settings. The method is demonstrated on inverted pendulum, VTOL aircraft, and quadcopter models, with system dimension two, six, and twelve respectively. I.
Learning bilingual word representations by marginalizing alignments
 In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL
, 2014
"... We present a probabilistic model that simultaneously learns alignments and distributed representations for bilingual data. By marginalizing over word alignments the model captures a larger semantic context than prior work relying on hard alignments. The advantage of this approach is demonstrated ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
We present a probabilistic model that simultaneously learns alignments and distributed representations for bilingual data. By marginalizing over word alignments the model captures a larger semantic context than prior work relying on hard alignments. The advantage of this approach is demonstrated in a crosslingual classification task, where we outperform the prior published state of the art. 1
Trust region policy optimization
 In ICML
, 2015
"... In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoreticallyjustified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for o ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoreticallyjustified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters. 1
Deep Learning for RealTime Atari Game Play Using Offline MonteCarlo Tree Search Planning
"... The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policyselection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a usefu ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policyselection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a useful benchmark set of such applications. A recent breakthrough in combining modelfree reinforcement learning with deep learning, called DQN, achieves the best realtime agents thus far. Planningbased approaches achieve far higher scores than the best modelfree approaches, but they exploit information that is not available to human players, and they are orders of magnitude slower than needed for realtime play. Our main goal in this work is to build a better realtime Atari game playing agent than DQN. The central idea is to use the slow planningbased agents to provide training data for a deeplearning architecture capable of realtime play. We proposed new agents based on this idea and show that they outperform DQN. 1
Learning simple algorithms from examples.
 In Proceedings of the International Conference on Machine Learning,
, 2016
"... Abstract We present an approach for learning simple algorithms such as copying, multidigit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1D tapes or 2D grids that hold the input and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract We present an approach for learning simple algorithms such as copying, multidigit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1D tapes or 2D grids that hold the input and output data. For the controller, we explore a range of neural networkbased models which vary in their ability to abstract the underlying algorithm from training instances and generalize to test examples with many thousands of digits. The controller is trained using Qlearning with several enhancements and we show that the bottleneck is in the capabilities of the controller rather than in the search incurred by Qlearning.
ActionConditional Video Prediction using Deep Networks in Atari Games
"... Motivated by visionbased reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatiotemporal prediction problems where future imageframes depend on control variables or actions as well as previous frames. While ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Motivated by visionbased reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatiotemporal prediction problems where future imageframes depend on control variables or actions as well as previous frames. While not composed of natural scenes, frames in Atari games are highdimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, actionconditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the proposed architectures are able to generate visuallyrealistic frames that are also useful for control over approximately 100step actionconditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate longterm predictions on highdimensional video conditioned by control inputs. 1
SelfModeling Agents and Reward Generator Corruption
 In: AAAI15 Workshop on AI and Ethics
, 2015
"... Hutter's universal artificial intelligence (AI) showed how to define future AI systems by mathematical equations. Here we adapt those equations to define a selfmodeling framework, where AI systems learn models of their own calculations of future values. Hutter discussed the possibility that AI ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Hutter's universal artificial intelligence (AI) showed how to define future AI systems by mathematical equations. Here we adapt those equations to define a selfmodeling framework, where AI systems learn models of their own calculations of future values. Hutter discussed the possibility that AI agents may maximize rewards by corrupting the source of rewards in the environment. Here we propose a way to avoid such corruption in the selfmodeling framework. This paper fits in the context of my book Ethical Artificial Intelligence. A draft of the book is available at: arxiv.org/abs/1411.1373. SelfModeling Agents Russell and Norvig defined a framework for AI agents interacting with an environment (Russell and Norvig 2010). Hutter adapted Solomonoff's theory of sequence prediction to this framework to produce mathematical equations that define behaviors of future AI systems (Hutter 2005). Assume that an agent interacts with its environment in a discrete, finite series of time steps t ∈ {0, 1, 2,..., T}. The agent sends an action at ∈ A to the environment and receives an observation ot ∈ O from the environment, where A and O are finite sets. We use h = (a1, o1,..., at, ot) to denote an interaction history where the environment produces observation oi in response to action ai for 1 ≤ i ≤ t. Let H be the set of all finite histories so that h ∈ H, and define h  = t as the length of the history h. An agent's predictions of its observations are uncertain so the agent's environment model takes the form of a probability distribution over interaction histories: (1) ρ: H → [0, 1].
Weaklysupervised disentangling with recurrent transformations for 3D view synthesis
 In NIPS
, 2015
"... An important problem for both graphics and vision is to synthesize novel views of a 3D object from a single image. This is particularly challenging due to the partial observability inherent in projecting a 3D object onto the image space, and the illposedness of inferring object shape and pose. Howe ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
An important problem for both graphics and vision is to synthesize novel views of a 3D object from a single image. This is particularly challenging due to the partial observability inherent in projecting a 3D object onto the image space, and the illposedness of inferring object shape and pose. However, we can train a neural network to address the problem if we restrict our attention to specific object categories (in our case faces and chairs) for which we can gather ample training data. In this paper, we propose a novel recurrent convolutional encoderdecoder network that is trained endtoend on the task of rendering rotated objects starting from a single image. The recurrent structure allows our model to capture longterm dependencies along a sequence of transformations. We demonstrate the quality of its predictions for human faces on the MultiPIE dataset and for a dataset of 3D chair models, and also show its ability to disentangle latent factors of variation (e.g., identity and pose) without using full supervision. 1
Compress and control.
 In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence,
, 2015
"... Abstract This paper describes a new informationtheoretic policy evaluation technique for reinforcement learning. This technique converts any compression or density model into a corresponding estimate of value. Under appropriate stationarity and ergodicity conditions, we show that the use of a suff ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract This paper describes a new informationtheoretic policy evaluation technique for reinforcement learning. This technique converts any compression or density model into a corresponding estimate of value. Under appropriate stationarity and ergodicity conditions, we show that the use of a sufficiently powerful model gives rise to a consistent value function estimator. We also study the behavior of this technique when applied to various Atari 2600 video games, where the use of suboptimal modeling techniques is unavoidable. We consider three fundamentally different models, all too limited to perfectly model the dynamics of the system. Remarkably, we find that our technique provides sufficiently accurate value estimates for effective onpolicy control. We conclude with a suggestive study highlighting the potential of our technique to scale to large problems.
Classical planning algorithms on the atari video games
 In Proc. of 2015 AAAI Workshop on Learning for General Competency in Video Games
, 2015
"... The Atari 2600 games supported in the Arcade Learning Environment (Bellemare et al. 2013) all feature a known initial (RAM) state and actions that have deterministic effects. Classical planners, however, cannot be used for selecting actions for two reasons: first, no compact PDDLmodel of the game ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The Atari 2600 games supported in the Arcade Learning Environment (Bellemare et al. 2013) all feature a known initial (RAM) state and actions that have deterministic effects. Classical planners, however, cannot be used for selecting actions for two reasons: first, no compact PDDLmodel of the games is given, and more importantly, the action effects and goals are not known a priori. Moreover, in these games there is usually no set of goals to be achieved but rewards to be collected. These features do not preclude the use of classical algorithms like breadthfirst search or Dijkstra’s algorithm, but these methods are not effective over large state spaces. We thus turn to a different class of classical planning algorithms introduced recently that perform a structured exploration of the state space; namely, like breadthfirst search and Dijkstra’s algorithm they are “blind ” and hence do not require prior knowledge of state transitions, costs (rewards) or goals, and yet, like heuristic search algorithms, they have been shown to be effective for solving problems over huge state spaces. The simplest such algorithm, called Iterated Width or IW, consists of a sequence of calls IW(1), IW(2),..., IW(k) where IW(i) is a breadthfirst search in which a state is pruned when it is not the first state in the search to make true some subset of i atoms. The empirical results over 54 games suggest that the performance of IW with the k parameter fixed to 1, i.e., IW(1), is at the level of the state of the art represented by UCT. A simple bestfirst variation of IW that combines exploration and exploitation proves to be very competitive as well.