DMCA
Short-term gains, long-term pains: How cues about state aid learning in dynamic environments (2009)
Citations: | 31 - 6 self |
Citations
4630 |
A new look at the statistical model identification
- Akaike
- 1974
(Show Context)
Citation Context ...f the fit quality between models requires a correction. We used the Akaike Information Criterion (AIC) which compares the fit quality of each model while correcting for the number of free parameters (=-=Akaike, 1974-=-) 5 . The value of the AIC for subject i and model m can be computed as follows: AIC m i = 2 · L m i − 2 · k m i (8) where ka are the number of free parameters in the model. Larger values of AIC m i m... |
2550 |
A Simplex Method for Function Minimization
- Nelder, Mead
- 1965
(Show Context)
Citation Context ...ject i. A parameter search was conducted to find the free parameters which maximized the value of Lm i for each subject and model using the Nelder-Mead simplex method with 200 random starting points (=-=Nelder & Mead, 1965-=-) 4 . Due to the fact that some of the models tested differed in the number of free parameters they possessed, direct comparison of the fit quality between models requires a correction. We used the Ak... |
766 |
Adaptive switching circuits
- Widrow, Hoff, et al.
- 1960
(Show Context)
Citation Context ...erm rewards is determined by a simple discounting parameter, γ. Note that when γ = 0, the error term in the model reduces to the standard delta rule (Rescorla & Wagner, 1972; Wagner & Rescorla, 1972; =-=Widrow & Hoff, 1960-=-). Accordingly, under these conditions, the model strongly favors immediate rewards and thus predicts melioration behavior in the task. As the value of γ increases, the model gives more weight to futu... |
455 |
Individual choice behavior: A theoretical analysis
- Luce
- 1959
(Show Context)
Citation Context ...ties are biased in favor of the value of Q(ai). In general, the probability of choosing option ai is an increasing function of the estimated value of that action, Q(ai), relative to the other action (=-=Luce, 1959-=-). However, the τ parameter controls how deterministic responding is. When τ → 0 each option is chosen randomly (the impact of learned values is effectively eliminated). Alternatively, as τ → ∞ the mo... |
433 | Generalization in reinforcement learning: Successful examples using sparse coarse coding
- Sutton
- 1996
(Show Context)
Citation Context ...nd engineering (c.f., Littman, Sutton, & Sigh, 2002). Indeed, many popular algorithms for learning sequential decision strategies in complex environments such as Q-learning (Watkins, 1989) and SARSA (=-=Sutton, 1996-=-; Sutton & Barto, 1998) require learning agents to correctly identify changes in the state of the environment as a consequence of their actions. The experimental manipulations that follow were inspire... |
322 | Reinforcement learning with selective perception and hidden state. - McCallum - 1996 |
286 |
TD-Gammon, a self-teaching backgammon program, achieves master-level play
- Tesauro
- 1994
(Show Context)
Citation Context ... approach to learning through interaction with the environment in pursuit of reward-maximizing behavior. The RL approach has been successful in both practical applications (Bagnell & Schneider, 2001; =-=Tesauro, 1994-=-), and in the modeling of biological systems (Daw & Touretzky, 2002; Montague, Dayan, & Sejnowski, 1996; Montague, Dayan, Person, & Sejnowski, 1995; Schultz, Dayan, & Montague, 1997; Suri, Bargas, & A... |
273 |
The art of adaptive pattern recognition by a self-organizing neural network
- Carpenter, Grossberg
- 1988
(Show Context)
Citation Context ...ion or generalization of the experience in one part of the state space to others could improve performance. This opens opportunities for evaluating the role that generalization and category creation (=-=Carpenter & Grossberg, 1988-=-; Love, Medin, & Gureckis, 2004; Sutton, 1996) have on performance in online, sequential choice tasks. In our experiments, we manipulated how apparent particular representations of the world were to p... |
228 | Cortical substrates for exploratory decisions in humans. - Daw, O'Doherty, et al. - 2006 |
226 |
Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control.
- Daw, Niv, et al.
- 2005
(Show Context)
Citation Context ...alized that the most effective strategy near the end of the experiment was to select the impulsive, short-term option (a behavior that might be accounted for with RL models that incorporate planning, =-=Daw, Niv, & Dayan, 2005-=-). Similarly, we are unable to assess yet the role that explicit reasoning processes may have played in accounting for the differences between the shuffledcue and consistent-cue conditions of Experime... |
196 | Neural economics and the biological substrates of valuation.
- Montague, Berns
- 2002
(Show Context)
Citation Context ...dies that place short-term and long-term response strategies in conflict (Bogacz, McClure, Li, Cohen, & Montague, 2007; Egelman, Person, & Montague, 1998; Herrnstein, 1991; Herrnstein & Prelec, 1991; =-=Montague & Berns, 2002-=-; Neth, Sims, & Gray, 2006; Tunney & Shanks, 2002). Interestingly, the conclusion from much of this work has been that humans and other animals often fail to inhibit the tendency to select an initiall... |
187 | SUSTAIN: A network model of category learning.
- Love, Medin, et al.
- 2004
(Show Context)
Citation Context ...experience in one part of the state space to others could improve performance. This opens opportunities for evaluating the role that generalization and category creation (Carpenter & Grossberg, 1988; =-=Love, Medin, & Gureckis, 2004-=-; Sutton, 1996) have on performance in online, sequential choice tasks. In our experiments, we manipulated how apparent particular representations of the world were to participants, however, additiona... |
153 | Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. IJCAI-91
- Chapman, Kaelbling
- 1991
(Show Context)
Citation Context ..., or if they simply served as a memory cue about recent actions. RL theorists have long recognized how an effective memory may help agents overcome some of the issues surrounding perceptual aliasing (=-=Chapman & Kaelbling, 1991-=-; McCallum, 1993, 1995). TheREINFORCEMENT LEARNING 16 intuition is that two states which appear identical (for example, the highly confusable floors of a hotel) may be distinguished by the recent beh... |
119 | Overcoming incomplete perception with utile distinction memory.
- McCallum
- 1993
(Show Context)
Citation Context ...navigate a simple maze, distinct locations can be perceptually identical (e.g., two hallways which have the same junctions). In this case, the agent must deal with the problem of perceptual aliasing (=-=McCallum, 1993-=-; Whitehead & Ballard, 1991), where multiple states or situations in the world may map to a single percept. When an agent is unsure of their current state, it is difficult to determine the most effect... |
117 | Autonomous helicopter control using reinforcement learning policy search methods
- Bagnell, Schneider
- 2001
(Show Context)
Citation Context ...1998). RL is an agent-based approach to learning through interaction with the environment in pursuit of reward-maximizing behavior. The RL approach has been successful in both practical applications (=-=Bagnell & Schneider, 2001-=-; Tesauro, 1994), and in the modeling of biological systems (Daw & Touretzky, 2002; Montague, Dayan, & Sejnowski, 1996; Montague, Dayan, Person, & Sejnowski, 1995; Schultz, Dayan, & Montague, 1997; Su... |
114 |
Learning to perceive and act by trial and error.
- Whitehead, Ballard
- 1991
(Show Context)
Citation Context ...e maze, distinct locations can be perceptually identical (e.g., two hallways which have the same junctions). In this case, the agent must deal with the problem of perceptual aliasing (McCallum, 1993; =-=Whitehead & Ballard, 1991-=-), where multiple states or situations in the world may map to a single percept. When an agent is unsure of their current state, it is difficult to determine the most effective course of action, and t... |
113 |
Interactive tasks and the implicit-explicit distinction.
- Berry, Broadbent
- 1988
(Show Context)
Citation Context ...ntal axis in panel A and better capture the underlying structure of the task. the robots, and thus could only arrive at the optimal strategy by interactively exploring the behavior of the system (cf. =-=Berry & Broadbent, 1988-=-; Stanley, Mathew, Russ, & KotlerCope, 1989). The Importance of State and Problem of Perceptual Aliasing An important challenge facing any learning agent is adopting an appropriate mental representati... |
108 |
Bee foraging in uncertain environments using predictive Hebbian learning.
- Montague
- 1995
(Show Context)
Citation Context ...ch has been successful in both practical applications (Bagnell & Schneider, 2001; Tesauro, 1994), and in the modeling of biological systems (Daw & Touretzky, 2002; Montague, Dayan, & Sejnowski, 1996; =-=Montague, Dayan, Person, & Sejnowski, 1995-=-; Schultz, Dayan, & Montague, 1997; Suri, Bargas, & Arbib, 2001). An attractive feature of RL for the present report is that it emphasizes the concept of a situated learner interacting with a responsi... |
103 |
A contribution of cognitive decision models to clinical assessment: decomposing performance on the Bechara gambling task.
- Busemeyer, Stout
- 2002
(Show Context)
Citation Context ...nt fit. Most importantly, this baseline comparison allows us to evaluate the degree to which the trial-by-trial dynamics generated by individual participants are explained by our learning models (see =-=Busemeyer & Stout, 2002-=- for a similar approach and motivation). Softmax Model. Next, we considered a version of the Softmax action selection model (Sutton & Barto, 1998; Daw, O’Doherty, Seymour, Dayan, & Dolan, 2006; Worthy... |
102 | The interaction of the explicit and the implicit in skill learning: A dual process approach.
- Sun, Slusarz, et al.
- 2005
(Show Context)
Citation Context ... our studies and join a number of recent papers providing encouraging support for using RL methods to model human behavior in sequential decision making tasks (Fu & Anderson, 2006; Neth et al., 2006; =-=Sun, Slusarz, & Terry, 2005-=-). In the following sections we highlight some of the contributions and implications of our results. The Importance of “State” First, while the concept of “state” is central to RL systems that are roo... |
93 |
Decision-making deficits, linked to a dysfunctional ventromedial prefrontal cortex, revealed in alcohol and stimulant abusers.
- Bechara, Dolan, et al.
- 2001
(Show Context)
Citation Context ...0003 USA.REINFORCEMENT LEARNING 2 pathologies associated with substance abusing populations are often characterized by the impulsive desire for immediate rewards over higher utility future outcomes (=-=Bechara et al., 2001-=-; Bechara & Damasio, 2002; Grant, Controreggi, & London, 2000). In this report, we examine how people learn strategies that maximize their long-term well-being in a dynamic decision making task that w... |
70 |
Insight without awareness: On the interaction of verbalization, instruction and practice in a simulated process control task.
- Stanley, Mathews, et al.
- 1989
(Show Context)
Citation Context ...ses an amazing capacity for interacting with and controlling the ongoing dynamics of their environment across a variety of tasks and situations (c.f., Berry & Broadbent, 1988; Chhabra & Jacobs, 2006; =-=Stanley et al., 1989-=-). However, like Aesop’s grasshopper, when we fail to take into account how immediately attractive options might conflict with our longer-term well-being, we often suffer the consequences. For example... |
66 | What you see is what you need.
- Triesch, Ballard, et al.
- 2003
(Show Context)
Citation Context ...appear to support the statebased interpretation, these two perspectives may not be completely at odds. In a sense, the perceptual cues provided in the task may serve as a kind of externalized memory (=-=Triesch, Ballard, Hayhoe, & Sullivan, 2003-=-), helping to reduce the load on cognitive resources by offloading memory into the environment. One prediction following from this idea is that performance in the Farming on Mars task may be more resi... |
56 |
The mental representation of parity and numerical magnitude.
- Dehaene, Bossini, et al.
- 1993
(Show Context)
Citation Context ...ions was counterbalanced along with the location of the response button itself. However, participants likely brought with them pre-existing associations concerning the left-right axis of the display (=-=Dehaene, Bossini, & Giraux, 1993-=-). Thus, one possibility is that the effect of the consistent light cue increased in conditions where the response button and direction of movement were compatible. However, in our data, we failed to ... |
55 |
Melioration: a Theory of Distributed Choice,"
- Herrnstein, Prelec
- 1991
(Show Context)
Citation Context ...from a number of recent studies that place short-term and long-term response strategies in conflict (Bogacz, McClure, Li, Cohen, & Montague, 2007; Egelman, Person, & Montague, 1998; Herrnstein, 1991; =-=Herrnstein & Prelec, 1991-=-; Montague & Berns, 2002; Neth, Sims, & Gray, 2006; Tunney & Shanks, 2002). Interestingly, the conclusion from much of this work has been that humans and other animals often fail to inhibit the tenden... |
54 | From recurrent choice to skill learning: A reinforcement learning model
- Fu, Anderson
- 2006
(Show Context)
Citation Context ... the simple network model that motivated our studies and join a number of recent papers providing encouraging support for using RL methods to model human behavior in sequential decision making tasks (=-=Fu & Anderson, 2006-=-; Neth et al., 2006; Sun, Slusarz, & Terry, 2005). In the following sections we highlight some of the contributions and implications of our results. The Importance of “State” First, while the concept ... |
46 |
Inhibition in Pavlovian conditioning: Application of a theory. In
- Wagner, Rescorla
- 1972
(Show Context)
Citation Context ...ners value shortor long-term rewards is determined by a simple discounting parameter, γ. Note that when γ = 0, the error term in the model reduces to the standard delta rule (Rescorla & Wagner, 1972; =-=Wagner & Rescorla, 1972-=-; Widrow & Hoff, 1960). Accordingly, under these conditions, the model strongly favors immediate rewards and thus predicts melioration behavior in the task. As the value of γ increases, the model give... |
43 | A computational role for dopamine delivery in human decision-making.
- Egelman
- 1998
(Show Context)
Citation Context ...y reach it. The reward structure of our task borrows from a number of recent studies that place short-term and long-term response strategies in conflict (Bogacz, McClure, Li, Cohen, & Montague, 2007; =-=Egelman, Person, & Montague, 1998-=-; Herrnstein, 1991; Herrnstein & Prelec, 1991; Montague & Berns, 2002; Neth, Sims, & Gray, 2006; Tunney & Shanks, 2002). Interestingly, the conclusion from much of this work has been that humans and o... |
41 | Reconciling reinforcement learning models with behavioral extinction and renewal: Implications for addiction, relapse, and problem gambling.
- Redish, Jensen, et al.
- 2007
(Show Context)
Citation Context ... to participant’s trial-by-trial choices). On the theoretical side, our results provide a first step in linking work in category learning and generalization into models of sequential choice (see also =-=Redish, Jensen, Johnson, & Kurth-Nelson, 2007-=- for recent work on this issue). In our simulations, we showed how cues that allowed extrapolation or generalization of the experience in one part of the state space to others could improve performanc... |
40 |
Estimating the Dimension of a Model”, The Annals of Statistics
- SCHWARTZ
- 1978
(Show Context)
Citation Context ...005) for complete details (our procedure followed the “Prediction Method” described on page 392 of their paper). 5 Similar results were found using the Bayesian Information Criterion, or BIC measure (=-=Schwartz, 1978-=-). 6 The curves shown in Figure 6 for each model were created by finding the best fit parameters for each subject (as described above). Next, for each model, we found the predicted probability of sele... |
30 | Short-Term Memory Traces for Action Bias in Human Reinforcement
- Bogacz, McClure, et al.
- 2007
(Show Context)
Citation Context ... from their current goal in order to ultimately reach it. The reward structure of our task borrows from a number of recent studies that place short-term and long-term response strategies in conflict (=-=Bogacz, McClure, Li, Cohen, & Montague, 2007-=-; Egelman, Person, & Montague, 1998; Herrnstein, 1991; Herrnstein & Prelec, 1991; Montague & Berns, 2002; Neth, Sims, & Gray, 2006; Tunney & Shanks, 2002). Interestingly, the conclusion from much of t... |
26 | Long-term reward prediction in TD models of the dopamine system.
- Daw, Touretzky
- 2002
(Show Context)
Citation Context ...ent in pursuit of reward-maximizing behavior. The RL approach has been successful in both practical applications (Bagnell & Schneider, 2001; Tesauro, 1994), and in the modeling of biological systems (=-=Daw & Touretzky, 2002-=-; Montague, Dayan, & Sejnowski, 1996; Montague, Dayan, Person, & Sejnowski, 1995; Schultz, Dayan, & Montague, 1997; Suri, Bargas, & Arbib, 2001). An attractive feature of RL for the present report is ... |
25 |
Modeling functions of striatal dopamine modulation in learning and planning
- Suri, Bargas, et al.
- 2001
(Show Context)
Citation Context ...01; Tesauro, 1994), and in the modeling of biological systems (Daw & Touretzky, 2002; Montague, Dayan, & Sejnowski, 1996; Montague, Dayan, Person, & Sejnowski, 1995; Schultz, Dayan, & Montague, 1997; =-=Suri, Bargas, & Arbib, 2001-=-). An attractive feature of RL for the present report is that it emphasizes the concept of a situated learner interacting with a responsive environment, making it an ideal framework for studying human... |
21 |
Learning from delayed rewards. Unpublished doctoral dissertation
- Watkins
- 1989
(Show Context)
Citation Context ...arch in computer science and engineering (c.f., Littman, Sutton, & Sigh, 2002). Indeed, many popular algorithms for learning sequential decision strategies in complex environments such as Q-learning (=-=Watkins, 1989-=-) and SARSA (Sutton, 1996; Sutton & Barto, 1998) require learning agents to correctly identify changes in the state of the environment as a consequence of their actions. The experimental manipulations... |
18 | Experiments on stable suboptimality in individual behavior
- Herrnstein
- 1991
(Show Context)
Citation Context ... our task borrows from a number of recent studies that place short-term and long-term response strategies in conflict (Bogacz, McClure, Li, Cohen, & Montague, 2007; Egelman, Person, & Montague, 1998; =-=Herrnstein, 1991-=-; Herrnstein & Prelec, 1991; Montague & Berns, 2002; Neth, Sims, & Gray, 2006; Tunney & Shanks, 2002). Interestingly, the conclusion from much of this work has been that humans and other animals often... |
12 |
A re-examination of melioration and rational choice
- Tunney, Shanks
- 2002
(Show Context)
Citation Context ..., a phenomena referred to as melioration. Melioration appears at odds with rational accounts, which dictate that decision makers follow a strategy that maximizes their long-term expected utility (see =-=Tunney and Shanks, 2002-=- for a similar discussion). However, the rational account fails to specify how this optimal strategy is discovered in an unknown environment. In this paper, we attempt to better understand the learnin... |
11 | Melioration dominates maximization: Stable suboptimal performance despite global feedback
- Neth, Sims, et al.
- 2006
(Show Context)
Citation Context ...rm and long-term response strategies in conflict (Bogacz, McClure, Li, Cohen, & Montague, 2007; Egelman, Person, & Montague, 1998; Herrnstein, 1991; Herrnstein & Prelec, 1991; Montague & Berns, 2002; =-=Neth, Sims, & Gray, 2006-=-; Tunney & Shanks, 2002). Interestingly, the conclusion from much of this work has been that humans and other animals often fail to inhibit the tendency to select an initially attractive option even w... |
7 |
Near-optimal human adaptive control across different noise environments
- Chhabra, Jacobs
- 2006
(Show Context)
Citation Context ... Implications Humans posses an amazing capacity for interacting with and controlling the ongoing dynamics of their environment across a variety of tasks and situations (c.f., Berry & Broadbent, 1988; =-=Chhabra & Jacobs, 2006-=-; Stanley et al., 1989). However, like Aesop’s grasshopper, when we fail to take into account how immediately attractive options might conflict with our longer-term well-being, we often suffer the con... |
6 |
A theory of pavolvian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement
- Rescorla, Wagner
- 1972
(Show Context)
Citation Context ... the degree to which learners value shortor long-term rewards is determined by a simple discounting parameter, γ. Note that when γ = 0, the error term in the model reduces to the standard delta rule (=-=Rescorla & Wagner, 1972-=-; Wagner & Rescorla, 1972; Widrow & Hoff, 1960). Accordingly, under these conditions, the model strongly favors immediate rewards and thus predicts melioration behavior in the task. As the value of γ ... |
6 | Learning by pigeons playing against tit-for-tat in an operant prisoner’s dilemma - Sanabria, Baker, et al. - 2003 |
6 |
Regulatory fit effects in a choice task. Psychon
- Worthy, Maddox, et al.
- 2007
(Show Context)
Citation Context ..., 2002 for a similar approach and motivation). Softmax Model. Next, we considered a version of the Softmax action selection model (Sutton & Barto, 1998; Daw, O’Doherty, Seymour, Dayan, & Dolan, 2006; =-=Worthy, Maddox, & Markman, 2007-=-). In this model, the probability of selecting either the Short-term or Long-term option is based on an estimate of value of each action which is learned through experience. The model’s current estima... |
5 |
Decision-making and addition (part I): Impaired activation of somatic states in substance dependent individuals when pondering decisions with negative future consequences
- Bechara, Damasio
- 2002
(Show Context)
Citation Context ...T LEARNING 2 pathologies associated with substance abusing populations are often characterized by the impulsive desire for immediate rewards over higher utility future outcomes (Bechara et al., 2001; =-=Bechara & Damasio, 2002-=-; Grant, Controreggi, & London, 2000). In this report, we examine how people learn strategies that maximize their long-term well-being in a dynamic decision making task that we refer to as the “Farmin... |
2 |
A framework for mesencephalic dopmaine system based on predictive hebbian learning
- Montague, Dayan, et al.
- 1996
(Show Context)
Citation Context ...d-maximizing behavior. The RL approach has been successful in both practical applications (Bagnell & Schneider, 2001; Tesauro, 1994), and in the modeling of biological systems (Daw & Touretzky, 2002; =-=Montague, Dayan, & Sejnowski, 1996-=-; Montague, Dayan, Person, & Sejnowski, 1995; Schultz, Dayan, & Montague, 1997; Suri, Bargas, & Arbib, 2001). An attractive feature of RL for the present report is that it emphasizes the concept of a ... |
2 |
A neural substrate of predicion and reward
- Schultz, Dayan, et al.
- 1997
(Show Context)
Citation Context ...lications (Bagnell & Schneider, 2001; Tesauro, 1994), and in the modeling of biological systems (Daw & Touretzky, 2002; Montague, Dayan, & Sejnowski, 1996; Montague, Dayan, Person, & Sejnowski, 1995; =-=Schultz, Dayan, & Montague, 1997-=-; Suri, Bargas, & Arbib, 2001). An attractive feature of RL for the present report is that it emphasizes the concept of a situated learner interacting with a responsive environment, making it an ideal... |
1 |
Drug abusers show impaired preformance in a laboratory test of decision making
- Grant, Controreggi, et al.
- 2000
(Show Context)
Citation Context ...associated with substance abusing populations are often characterized by the impulsive desire for immediate rewards over higher utility future outcomes (Bechara et al., 2001; Bechara & Damasio, 2002; =-=Grant, Controreggi, & London, 2000-=-). In this report, we examine how people learn strategies that maximize their long-term well-being in a dynamic decision making task that we refer to as the “Farming on Mars” task. In our experiments,... |