### Citations

1902 |
Markov Decision Processes: Discrete Stochastic Dynamic Programming.
- Puterman
(Show Context)
Citation Context ...ittman, 1996)). Once Q' is identified the learning agent can act optimally in t.he underlying 'vIDP simply by choosing the action \vhieh maximizes Q* (x, a) \vhen the agent is in state x (Ross, 1970; =-=Puterman, 1994-=-). 3 THE MAIN RESULT Condition (2) on the lcaming rate "t(x, a) requires only that every state-action pair is visited infinitely often, which is a rather mild condition. In this article we take the st... |

353 |
Applied Probability Models with Optimization Applications
- Ross
- 1970
(Show Context)
Citation Context ...esvari and Littman, 1996)). Once Q' is identified the learning agent can act optimally in t.he underlying 'vIDP simply by choosing the action \vhieh maximizes Q* (x, a) \vhen the agent is in state x (=-=Ross, 1970-=-; Puterman, 1994). 3 THE MAIN RESULT Condition (2) on the lcaming rate "t(x, a) requires only that every state-action pair is visited infinitely often, which is a rather mild condition. In this articl... |

255 | On the convergence of stochastic iterative dynamic programming.
- Jaakkola, Jordan, et al.
- 1994
(Show Context)
Citation Context ...ing is guaranteed to converge to the only fixed point Q* of the operator T : !R x x A --+ !RX x A defined by (TQ)(x, a) = R(x, a) +, L P(x, a, y) mr; xQ(Y, b) VEX (convergence proofs can be found in (=-=Jaakkola et al., 1994-=-; Tsitsiklis, 1994; Littman and Szepesvari, 1996; Szepesvari and Littman, 1996)). Once Q' is identified the learning agent can act optimally in t.he underlying 'vIDP simply by choosing the action \vhi... |

204 | Asynchronous Stochastic Approximation and Q-Learning.
- Tsitsiklis
- 1994
(Show Context)
Citation Context ...esponding to the stationary distribution. 1 INTRODUCTION Q-learning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et aI., 1994; =-=Tsitsiklis, 1994-=-; Littman and Szepesvilri, 1996; Szepesviiri and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookup-table based) Q-Iearning algorithms. Although, th... |

125 | Reinforcement Learning with Soft State Aggregation
- Singh, Jaakkola, et al.
- 1995
(Show Context)
Citation Context ...sage of Large Deviation's Theory ,vould enable us to develope non-asymptotic bounds.Other possible ways to extend the results of this paper may include Q-Iearning when learning on aggregated states (=-=Singh et al., 1995-=-), Q-learning for alternating/simultaneous Ma.rkov games (Littman, 1994; Szepesvaxi and Littman, 1996) and an:r other algorithms ,,,hose corresponding difference process Llt satisfies an inequality si... |

54 |
Stochastic Approximation.
- Wasan
- 1969
(Show Context)
Citation Context ...nce of this change the process Qt clearly converges to Q' and this convergence may be investigated along each component (x: a) separately using standard stochastic-approximation techniques (see e.g. (=-=Wasan, 1969-=-; Pol.iak and Tsypkin, 1973)). Using simple devices one can show that the difference process �,.(x, a) = IQ, (x, a) Qt(x,a)1 satisfies the following inequality: �t+l (x, a) :S (1 - (tt(x, a))�t(x, a)... |

31 |
A generalized reinforcement learning model: Convergence and applications.
- Littman, Szepesvari
- 1996
(Show Context)
Citation Context ... Q* of the operator T : !R x x A --+ !RX x A defined by (TQ)(x, a) = R(x, a) +, L P(x, a, y) mr; xQ(Y, b) VEX (convergence proofs can be found in (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and =-=Szepesvari, 1996-=-; Szepesvari and Littman, 1996)). Once Q' is identified the learning agent can act optimally in t.he underlying 'vIDP simply by choosing the action \vhieh maximizes Q* (x, a) \vhen the agent is in sta... |

24 | Pseudogradient Adaptation and Training Algorithms. - Polyak, Tsypkin - 1973 |

9 |
Generalized Markov Decision Processes: Dynamic programming and reinforcement learning algorithms. Neural Computation
- Littman
- 1997
(Show Context)
Citation Context ...r T : !R x x A --+ !RX x A defined by (TQ)(x, a) = R(x, a) +, L P(x, a, y) mr; xQ(Y, b) VEX (convergence proofs can be found in (Jaakkola et al., 1994; Tsitsiklis, 1994; Littman and Szepesvari, 1996; =-=Szepesvari and Littman, 1996-=-)). Once Q' is identified the learning agent can act optimally in t.he underlying 'vIDP simply by choosing the action \vhieh maximizes Q* (x, a) \vhen the agent is in state x (Ross, 1970; Puterman, 19... |

3 | A modified form of the iterative method of dynamic programming. Annals of Statistics - Hordjik, A, et al. - 1975 |

2 |
To discount or not to discount in reinforcement learning: A case study comparing R learning and Q learning
- unknown authors
- 1994
(Show Context)
Citation Context ... to the stationary distribution. 1 INTRODUCTION Q-learning is a popular reinforcement learning (RL) algorithm whose convergence is well demonstrated in the literature (Jaakkola et aI., 1994; Tsitsikli=-=s, 1994-=-; Littman and Szepesvilri, 1996; Szepesviiri and Littman, 1996). Our aim in this paper is to provide an upper bound for the convergence rate of (lookup-table based) Q-Iearning algorithms. Although, th... |

2 | Csaba Szepesva ri - Singh, Jaakkola, et al. - 1997 |

1 |
Markov games aB a framework for mult.i-agent. reinforcement. learning
- man
- 1994
(Show Context)
Citation Context ...n, 1996)). Once Q' is identified the learning agent can act optimally in t.he underlying 'vIDP simply by choosing the action \vhieh maximizes Q* (x, a) \vhen the agent is in state x (Ross, 1970; Puter=-=man, 1994-=-). 3 THE MAIN RESULT Condition (2) on the lcaming rate "t(x, a) requires only that every state-action pair is visited infinitely often, which is a rather mild condition. In this article we take the st... |

1 | Average reward reinforcement learning: Foundations, algorithms, and empirical results - Vlahadevan - 1996 |

1 | A law of the iterated logarithm for the Robbins-Monro method - Vajor - 1973 |