Results 11  20
of
114
Reinforcement Learning in Robotics: A Survey
"... Reinforcement learning offers to robotics a framework and set oftoolsfor the design of sophisticated and hardtoengineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between di ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
Reinforcement learning offers to robotics a framework and set oftoolsfor the design of sophisticated and hardtoengineer behaviors. Conversely, the challenges of robotic problems provide both inspiration, impact, and validation for developments in reinforcement learning. The relationship between disciplines has sufficient promise to be likened to that between physics and mathematics. In this article, we attempt to strengthen the links between the two research communities by providing a survey of work in reinforcement learning for behavior generation in robots. We highlight both key challenges in robot reinforcement learning as well as notable successes. We discuss how contributions tamed the complexity of the domain and study the role of algorithms, representations, and prior knowledge in achieving these successes. As a result, a particular focus of our paper lies on the choice between modelbased and modelfree as well as between value functionbased and policy search methods. By analyzing a simple problem in some detail we demonstrate how reinforcement learning approaches may be profitably applied, and
2009a): “A Dynamic Oligopoly Game of the US Airline Industry: Estimation and Policy Experiments
"... This paper estimates the contribution of demand, cost and strategic factors to explain why most companies in the US airline industry operate using a hubspoke network. We postulate and estimate a dynamic oligopoly model where airline companies decide, every quarter, which routes (directional citypa ..."
Abstract

Cited by 32 (9 self)
 Add to MetaCart
This paper estimates the contribution of demand, cost and strategic factors to explain why most companies in the US airline industry operate using a hubspoke network. We postulate and estimate a dynamic oligopoly model where airline companies decide, every quarter, which routes (directional citypairs) to operate, the type of product (direct flight vs. stopflight), and the fare of each routeproduct. The model incorporates three factors which may contribute to the profitability of hubspoke networks. First, consumers may value the scale of operation of an airline in the origin and destination airports (e.g., more convenient checkingin and landing facilities). Second, operating costs and entry costs may depend on the airline’s network because economies of density and scale. And third, a hubspoke network may be an strategy to deter the entry of non hubspoke carriers in some routes. We estimate our dynamic oligopoly model using panel data from the Airline Origin and Destination Survey with information on quantities, prices, and entry and exit decisions for every airline company over more than two thousand citypair markets and several years. Demand and variable cost parameters are estimated using demand equations and NashBertrand equilibrium conditions for prices. In a second step, we estimate fixed operating costs and sunk costs from the dynamic entryexit game. Counterfactual experiments show that hubsize effects on entry costs is, by far, the most important factor to explain hubspoke networks. Strategic entry deterrence is also significant and more important to explain hubspoke networks than hubsize effects on demand, variable costs or fixed costs.
Solving factored MDPs with hybrid state and action variables
 J. Artif. Intell. Res. (JAIR
"... Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model tha ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
(Show Context)
Efficient representations and solutions for large decision problems with continuous and discrete variables are among the most important challenges faced by the designers of automated decision support systems. In this paper, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a new hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function by a linear combination of basis functions and optimize its weights by linear programming. We analyze both theoretical and computational aspects of this approach, and demonstrate its scaleup potential on several hybrid optimization problems. 1.
FiniteTime Bounds for Fitted Value Iteration
"... In this paper we develop a theoretical analysis of the performance of samplingbased fitted value iteration (FVI) to solve infinite statespace, discountedreward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come i ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
(Show Context)
In this paper we develop a theoretical analysis of the performance of samplingbased fitted value iteration (FVI) to solve infinite statespace, discountedreward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finitetime bounds on the performance of two versions of samplingbased FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L pnorm performance bounds. As a result, our technique applies to a large class of functionapproximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discountedaverage concentrability of the futurestate distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned ” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings.
Estimating semiparametric ARCH(∞) models by kernel smoothing methods
 Econometrica
, 2005
"... Contents: ..."
A unifying framework for computational reinforcement learning theory
, 2009
"... Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understand ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
Computational learning theory studies mathematical models that allow one to formally analyze and compare the performance of supervisedlearning algorithms such as their sample complexity. While existing models such as PAC (Probably Approximately Correct) have played an influential role in understanding the nature of supervised learning, they have not been as successful in reinforcement learning (RL). Here, the fundamental barrier is the need for active exploration in sequential decision problems. An RL agent tries to maximize longterm utility by exploiting its knowledge about the problem, but this knowledge has to be acquired by the agent itself through exploring the problem that may reduce shortterm utility. The need for active exploration is common in many problems in daily life, engineering, and sciences. For example, a Backgammon program strives to take good moves to maximize the probability of winning a game, but sometimes it may try novel and possibly harmful moves to discover how the opponent reacts in the hope of discovering a better gameplaying strategy. It has been known since the early days of RL that a good tradeoff between exploration and exploitation is critical for the agent to learn fast (i.e., to reach nearoptimal strategies
Optimal design of sequential realtime communication systems
 IEEE Trans. Inf. Theory
, 2009
"... Abstract—Optimal design of sequential realtime communication of a Markov source over a noisy channel is investigated. In such a system, the delay between the source output and its reconstruction at the receiver should equal a fixed prespecified amount. An optimal communication strategy must minimiz ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
(Show Context)
Abstract—Optimal design of sequential realtime communication of a Markov source over a noisy channel is investigated. In such a system, the delay between the source output and its reconstruction at the receiver should equal a fixed prespecified amount. An optimal communication strategy must minimize the total expected symbolbysymbol distortion between the source output and its reconstruction. Design techniques or performance bounds for such realtime communication systems are unknown. In this paper a systematic methodology, based on the concepts of information structures and information states, to search for an optimal realtime communication strategy is presented. This methodology trades off complexity in communication length (linear in contrast to doubly exponential) with complexity in alphabet sizes (doubly exponential in contrast to exponential). As the communication length is usually order of magnitudes bigger
Random sampling of states in dynamic programming
 in Proc. NIPS Conf., 2007
"... Abstract—We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding s ..."
Abstract

Cited by 21 (4 self)
 Add to MetaCart
(Show Context)
Abstract—We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steadystate policies for deterministic timeinvariant discrete time control problems with continuous states and actions often found in robotics. In this paper, we describe our approach and provide initial results on several simulated robotics problems. Index Terms—Dynamic programming, optimal control, random sampling. I.
The browser war  econometric analysis of Markov perfect equilibrium in markets with network effects
 TRIAL EXHIBIT: MICROSOFT OEM SALES FY ’98 MIDYEAR REVIEW (GX 421) IN UNITED STATES V. MICROSOFT CORPORATION, CIVIL ACTION NO
, 2004
"... When demands for heterogeneous goods in a concentrated market shift over time due to network or contagion effects, forwardlooking firms consider the strategic impact of investment, pricing, and other conduct. Network effects may be a substantial barrier to entry, giving both entrants and incumbents ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
When demands for heterogeneous goods in a concentrated market shift over time due to network or contagion effects, forwardlooking firms consider the strategic impact of investment, pricing, and other conduct. Network effects may be a substantial barrier to entry, giving both entrants and incumbents powerful strategic incentives to “tip” the market. A Markov perfect equilibrium model captures this strategic behavior, and permits the comparison of “as is ” market trajectories with “but for ” trajectories under counterfactuals where “bad acts ” by some firms are eliminated. Our analysis is applied to a stylized description of the browser war between Netscape and Microsoft. Appendices give conditions for econometric identification and estimation of a Markov perfect equilibrium model from observations on partial trajectories, and discuss estimation of the impacts of
A Simple Nonparametric Estimator for the Distribution of Random Coefficients in Discrete Choice Models
, 2008
"... We propose an estimator for discrete choice models, such as the logit, with a nonparametric distribution of random coefficients. The estimator is linear regression subject to linear inequality constraints and is robust, simple to program and quick to compute compared to alternative estimators for mi ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
We propose an estimator for discrete choice models, such as the logit, with a nonparametric distribution of random coefficients. The estimator is linear regression subject to linear inequality constraints and is robust, simple to program and quick to compute compared to alternative estimators for mixture models. We discuss three methods for proving identification of the distribution of heterogeneity for any given economic model. We prove the identification of the logit mixtures model, which, surprisingly given the wide use of this model over the last 30 years, is a new result. We also derive our estimator’s nonstandard asymptotic distribution and demonstrate its excellent small sample properties in a Monte Carlo. The estimator we propose can be extended to allow for endogenous prices. The estimator can also be used to reduce the computational burden of nested fixed point methods for complex models like dynamic programming discrete choice.