Online learning in Markov decision processes with adversarially chosen transition probability An Online Policy Gradient Algorithm 28 distributions (2013)

by Y Abbasi-Yadkori, P Bartlett, V Kanade, Y Seldin, C Szepesvari
Venue:In Advances in Neural Information Processing Systems 26