We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-Learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills.
|
2044
|
Learning internal representations by error propagation
– Rumelhart, G, et al.
- 1986
|
|
1933
|
Reinforcement Learning: An introduction
– Sutton, Barto
- 1998
|
|
1487
|
Dynamic programming
– Bellman
- 1957
|
|
941
|
Reinforcement learning: A survey
– Kaelbling, Littman, et al.
- 1996
|
|
931
|
Learning to predict by the methods of temporal differences
– Sutton
- 1988
|
|
487
|
Some studies in machine learning using the game of checkers II: Recent progress
– Samuel
- 1967
|
|
394
|
Neuronlike adaptive elements that can solve difficult learning control problems
– Barto, Sutton, et al.
- 1983
|
|
378
|
Adaptive Switching Circuits
– Widrow, Hoff
- 1960
|
|
339
|
Nonlinear Programming
– Bertsekas
- 2003
|
|
321
|
A learning Algorithm for Continually Running Fully Recurrent Neural Networks
– Williams, Zipser
- 1989
|
|
321
|
Identification and control of dynamical systems using neural networks
– Narendra, Parthasarathy
- 1990
|
|
245
|
Prioritized sweeping: Reinforcement learning with less data and less real time
– Moore, Atkeson
- 1993
|
|
223
|
Improving elevator performance using reinforcement learning
– Crites, Barto
- 1996
|
|
221
|
Option pricing: A simplified approach
– Cox, Ross, et al.
- 1979
|
|
208
|
Temporal Credit Assignment in Reinforcement Learning
– Sutton
- 1984
|
|
188
|
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
– Williams
- 1992
|
|
176
|
Portfolio selection: Efficient diversification of investments
– Markowitz
- 1970
|
|
176
|
Y.: Policy gradient methods for reinforcement learning with function approximation
– Sutton, McAllester, et al.
- 2000
|
|
172
|
Backpropagation through time: What it does and how to do it
– Werbos
- 1990
|
|
160
|
Learning with delayed rewards
– Watkins
- 1989
|
|
155
|
TD-Gammon, a self-teaching backgammon program, achieves master-level play
– Tesauro
- 1994
|
|
126
|
Lifetime portfolio selection under uncertainty: the continuous-time case The
– Merton
- 1969
|
|
109
|
Valuing American options by simulation: A simple least-squares approach
– Longstaff, Schwartz
- 2001
|
|
101
|
Gradient descent for general reinforcement learning
– Baird, Moore
- 1999
|
|
98
|
Continuous-Time Finance
– Merton
- 1990
|
|
92
|
Technical note: Q-learning
– Watkins, Dayan
- 1992
|
|
92
|
Actor-Critic Algorithms
– Konda, Tsitsiklis
- 2000
|
|
80
|
E cient learning and planning within the Dyna framework
– Peng, Williams
- 1993
|
|
54
|
Mutual Fund Performance
– Sharpe
- 1966
|
|
54
|
Simulation-Based Optimization of Markov Reward Processes
– Marbach, Tsitsiklis
- 1998
|
|
52
|
Strategic Asset Allocation
– Brennan, Schwartz, et al.
- 1997
|
|
47
|
Direct Gradient-Based Reinforcement Learning: II. Gradient Descent Algorithms and Experiments
– Baxter, Weaver, et al.
- 1999
|
|
40
|
Toward a theory of reinforcement-learning connectionist systems
– Williams
- 1988
|
|
40
|
T.: High-performance job-shop scheduling with a timedelay TD (λ) network
– Zhang, Dietterich
- 1996
|
|
33
|
Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms and an Application to Pricing Financial Derivatives
– Tsitsiklis, Roy
- 1999
|
|
29
|
Security Markets, Stochastic Models
– Duffie
- 1988
|
|
25
|
private communication
– unknown authors
- 2000
|
|
16
|
Risk-sensitive reinforcement learning
– Mihatsch, Neuneier
- 2002
|
|
15
|
Simulation of self-organizing systems by digital computer
– Farley, Clark
- 1954
|
|
12
|
Performance functions and reinforcement learning for trading systems and portfolios
– Moody, Wu, et al.
- 1998
|
|
11
|
Localizing policy gradient estimates to action transitions
– Grudic, Ungar
- 2000
|
|
7
|
Optimal asset allocation using adaptive dynamic programming
– Neuneier
- 1996
|
|
7
|
On the use and misuse of downside risk
– Sortino, Forsey
- 1996
|
|
7
|
A brief history of downside risk measures
– Nawrocki
- 1999
|
|
6
|
Optimal algorithms and lower partial moment: Ex post results
– Nawrocki
- 1991
|
|
4
|
difference learning and TD-Gammon
– “Temporal
- 1995
|
|
3
|
Temporal-difference learning and applications in finance,” in Computational Finance
– Roy
- 1999
|
|
2
|
Reinforcement learning for trading
– Moody, Saffell
- 1999
|
|
2
|
consumption and portfolio rules in a continuous-time model
– “Optimum
- 1971
|
|
2
|
Dynamic programming applications in finance
– Elton, Gruber
- 1971
|