## Locally Bayesian Learning with Applications to Retrospective Revaluation and Highlighting (2006)

Venue: | Psychological Review |

Citations: | 28 - 7 self |

### BibTeX

@ARTICLE{Kruschke06locallybayesian,

author = {John K. Kruschke},

title = {Locally Bayesian Learning with Applications to Retrospective Revaluation and Highlighting},

journal = {Psychological Review},

year = {2006},

pages = {677--699}

}

### OpenURL

### Abstract

A scheme is described for locally Bayesian parameter updating in models structured as successions of component functions. The essential idea is to back-propagate the target data to interior modules, such that an interior component’s target is the input to the next component that maximizes the probability of the next component’s target. Each layer then does locally Bayesian learning. The approach assumes online trial-by-trial learning. The resulting parameter updating is not globally Bayesian but can better capture human behavior. The approach is implemented for an associative learning model that first maps inputs to attentionally filtered inputs and then maps attentionally filtered inputs to outputs. The Bayesian updating allows the associative model to exhibit retrospective revaluation effects such as backward blocking and unovershadowing, which have been challenging for associative learning models. The back-propagation of target values to attention allows the model to show trial-order effects, including highlighting and differences in magnitude of forward and backward blocking, which have been challenging for Bayesian learning models.

### Citations

8842 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ting of a single parameter value as done in some analyses (e.g., Russell, Binder, Koller, & Kanazawa, 1995). Locally Bayesian learning might be related to the expectation maximization (EM) algorithm (=-=Dempster, Laird, & Rubin, 1977-=-), which is often applied to situations in which the data have missing values, analogous to the missing internal targets in the present scenario. Unlike locally Bayesian updating, however, the typical... |

1272 |
Information Theory, Inference, and Learning Algorithms
- MacKay
- 2003
(Show Context)
Citation Context ...ant Bayesian learning models. Bayesian Modeling Generally The benefits of Bayesian approaches to model fitting and model comparison have been compellingly discussed and demonstrated (e.g., Lee, 2004; =-=MacKay, 2003-=-; Myung & Pitt, 1997). Here, I provide a brief overview of Bayesian modeling as background for discussing Bayesian models of learning. Suppose we have data that we are trying to model. Each datum repr... |

642 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...nomena explored in subsequent sections, will not be exhibited by such Bayesian models. Existing models of learning that are trial-order invariant include Bayesian neural networks (e.g., MacKay, 2003; =-=Neal, 1996-=-), sigmoid belief networks (Courville et al., 2004), and noisy-OR causal models (e.g., Sobel, Tenenbaum, & Gopnik, 2004; Tenenbaum & Griffiths, 2003). Trial-order invariance is a deficiency for many e... |

614 |
A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement
- Rescorla, Wagner
- 1972
(Show Context)
Citation Context ...earning theory, I am specifically assuming online learning, not batch learning. This degree of temporal resolution is typical in theories of associative learning, such as the classic Rescorla–Wagner (=-=Rescorla & Wagner, 1972-=-) model. It is possible to address temporal processes at a finer resolution (e.g., the duration between stimulus offset and outcome onset within a single trial), or at a coarser resolution (e.g., temp... |

481 |
The adaptive character of thought
- Anderson
- 1990
(Show Context)
Citation Context ...capture the ordinal trends of the inverse base rate effect but relies on the categories having different base rates: “[feature] mismatches will be weighed more seriously for high-frequency diseases” (=-=Anderson, 1990-=-, p. 119). Specifically, for probe PE.PL, the missing feature I mismatches both categories but is weighed more heavily for the high-frequency E category. This difference will not occur in the canonica... |

427 |
ALCOVE: an exemplar-based connectionist model of category learning
- Kruschke
- 1992
(Show Context)
Citation Context ...s of traditional or Bayesian back-propagation networks (MacKay, 2003; Neal, 1996; Rumelhart et al., 1995; Rumelhart, Hinton, & Williams, 1986) or the random covering map in the original ALCOVE model (=-=Kruschke, 1992-=-). To qualitatively reproduce the simulation results reported above, I believe that the hypothesis space must contain hypotheses that implement the notion of selective attention, whereby cues can be s... |

332 |
Individual choice behavior
- LUCE
- 1959
(Show Context)
Citation Context ...,k)/[� j exp(net out,j)]. This formula is simply the wellknown softmax function from the connectionist literature (Bridle, 1990) and is also an often-used exponentiated version of Luce’s choice rule (=-=Luce, 1959-=-). In the particular applications here, there are only two outcome categories. In this situation, the second outcome node is redundant with the first outcome node (because the outcomes are mutually ex... |

264 | Q learning
- Watkins, Dayan
- 1992
(Show Context)
Citation Context ... model, which was introduced to associative learning researchers by Sutton (1992) and has been further developed by Dayan, Kakade, et al. (e.g., Dayan & Kakade, 2001; Dayan, Kakade, & Montague, 2000; =-=Dayan & Yu, 2003-=-; Kakade & Dayan, 2002). In a Kalman filter, outcomes are computed as a weighted sum of input cues. The weighting coefficients are the values that are being learned. The weights have prior belief dist... |

241 |
Probabilistic interpretation of feed forward classification network outputs with relationships to statistical pattern recognition
- Bridle
- 1990
(Show Context)
Citation Context ... th outcome node is activated is defined as p(y out,k � 1�W out, x out) � exp(net out,k)/[� j exp(net out,j)]. This formula is simply the wellknown softmax function from the connectionist literature (=-=Bridle, 1990-=-) and is also an often-used exponentiated version of Luce’s choice rule (Luce, 1959). In the particular applications here, there are only two outcome categories. In this situation, the second outcome ... |

238 | The adaptative nature of human categorization - Anderson - 1991 |

181 |
A theory of attention: Variations in the associability of stimuli with reinforcement
- Mackintosh
- 1975
(Show Context)
Citation Context ... in theories of learning is that people and animals do, in fact, learn to attend to cues that are diagnostic for correct responses and learn to ignore cues that are irrelevant (e.g., Kruschke, 2003a; =-=Mackintosh, 1975-=-; Trabasso & Bower, 1968). Thus, when learning to associate cues with (overt) responses, people are also learning to associate cues with (covert) attentional distributions over those cues. The model i... |

177 | A theory of causal learning in children: causal maps and Bayes nets
- Gopnik, Glymour, et al.
- 2004
(Show Context)
Citation Context ...act, but the exact computations regarding which cues are expected, and their magnitude of negativity, have been left unspecified. The Bayesian models of backward blocking (e.g., Dayan & Kakade, 2001; =-=Gopnik et al., 2004-=-; Sobel et al., 2004; Tenenbaum & Griffiths, 2003) show the effects by shifting belief probability over hypotheses about cue-outcome correspondences. For example, suppose the model has three hypothese... |

166 | The evidence framework applied to classification networks
- MacKay
- 1992
(Show Context)
Citation Context ...mbinations of outcome weights; that is, there are 3 N hypotheses in the hypothesis space for mapping attended cues to outcomes. As is typical in applications of Bayesian connectionist networks (e.g., =-=MacKay, 1992-=-, 2003; Rumelhart, Durbin, Golden, & Chauvin, 1995), I specify a normal (i.e., Gaussian) prior on the outcome-weight hypothesis space. In situations when the two categorical outcomes should have no pr... |

115 |
Conservatism in human information processing
- Edwards
- 1968
(Show Context)
Citation Context ...d particular model functions for the mental hypotheses, then another challenge is showing that Bayesian updating of belief probabilities matches human learning. Research in the 1960s and 1970s (e.g., =-=Edwards, 1968-=-; Godden, 1976; Shanteau, 1975) tried to make the hypotheses utterly simple and explicit. For example, subjects were told the numbers of red and blue chips sampled so far from an unknown bag and were ... |

114 |
Learning internal representations by backpropagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...a random smattering of weighted cue combinations, much like the typical random starting weights of traditional or Bayesian back-propagation networks (MacKay, 2003; Neal, 1996; Rumelhart et al., 1995; =-=Rumelhart, Hinton, & Williams, 1986-=-) or the random covering map in the original ALCOVE model (Kruschke, 1992). To qualitatively reproduce the simulation results reported above, I believe that the hypothesis space must contain hypothese... |

100 |
Attention-like’’ processes in classical conditioning
- Kamin
- 1968
(Show Context)
Citation Context ...ghlighting. The point is that the locally Bayesian processing not only shows highlighting but also happens to accelerate learning in this case. Application to Blocking and Backward Blocking Blocking (=-=Kamin, 1968-=-) occurs when the early phase of learning has trials of A3X and the later phase has trials of A.B3X. Notice that in the second phase, the same outcome is indicated by an additional cue. In subsequent ... |

98 |
Backpropagation: the basic theory
- Rumelhart, Durbin, et al.
- 1996
(Show Context)
Citation Context ...e weights; that is, there are 3 N hypotheses in the hypothesis space for mapping attended cues to outcomes. As is typical in applications of Bayesian connectionist networks (e.g., MacKay, 1992, 2003; =-=Rumelhart, Durbin, Golden, & Chauvin, 1995-=-), I specify a normal (i.e., Gaussian) prior on the outcome-weight hypothesis space. In situations when the two categorical outcomes should have no prior bias, the prior probability of W out is set pr... |

96 |
Inferring causal networks from observations and interventions
- Steyvers, Tenenbaum, et al.
- 2003
(Show Context)
Citation Context ... applied to a range of phenomena from low-level perceptual learning (e.g., Eckstein, Abbey, Pham, & Shimozaki, 2004) to highlevel causal induction and language acquisition (e.g., Regier & Gahl, 2004; =-=Steyvers, Tenenbaum, Wagenmakers, & Blum, 2003-=-). If the Bayesian approach to learning is to be a general principle for modeling the mind, then it is logical to attempt Bayesian learning for the entire hierarchy of representations simultaneously. ... |

94 |
Forward and backward blocking in human contingency judgement
- Shanks, R
- 1985
(Show Context)
Citation Context ...of reduction in forward blocking (see e.g., Beckers, De Houwer, Pineño, & Miller, 2005; Kruschke & Blair, 2000; Lovibond, Been, Mitchell, Bouton, & Frohardt, 2003; Pineño, Urushihara, & Miller, 2005; =-=Shanks, 1985-=-). This asymmetry in strengths of forward and backward blocking is a trialorder effect that is challenging for extant time-independent Bayesian approaches. There have been several previous theories of... |

79 | Toward a unified model of attention in associative learning - Kruschke - 2001 |

79 | Local learning in probabilistic networks with hidden variables
- Russell, Binder, et al.
- 1995
(Show Context)
Citation Context ...ted in the present scheme is the probability distribution over possible parameter values within a layer; this process is not local updating of a single parameter value as done in some analyses (e.g., =-=Russell, Binder, Koller, & Kanazawa, 1995-=-). Locally Bayesian learning might be related to the expectation maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977), which is often applied to situations in which the data have missing value... |

76 | Cue competition in causality judgments: The role of nonpresentation of compound stimulus elements - Hamme, Wasserman - 1994 |

75 |
Applying Occam’s Razor in Modeling Cognition: A Bayesian approach. Psychonomic Bulletin & review
- Myung, Pitt
- 1997
(Show Context)
Citation Context ...earning models. Bayesian Modeling Generally The benefits of Bayesian approaches to model fitting and model comparison have been compellingly discussed and demonstrated (e.g., Lee, 2004; MacKay, 2003; =-=Myung & Pitt, 1997-=-). Here, I provide a brief overview of Bayesian modeling as background for discussing Bayesian models of learning. Suppose we have data that we are trying to model. Each datum represents the target re... |

73 |
Within-compound associations mediate the retrospective revaluation of causality judgements
- Dickinson, &Burke
- 1996
(Show Context)
Citation Context ...nfluence on learning, it cannot account for backward blocking. Extensions of the model, that assume absent cues have a negative impact on learning, can account for backward blocking or other effects (=-=Dickinson & Burke, 1996-=-; Ghirlanda, 2005; Markman, 1989; Tassoni, 1995; Van Hamme & Wasserman, 1994). The models assert that only absent cues that are expected to be present should have negative impact, but the exact comput... |

62 |
The comparator hypothesis: A response rule for the expression of associations
- Miller, Matzel
(Show Context)
Citation Context ...models of learning to address because it is observed in a wide variety of procedures and species, and it appears to disconfirm any model that merely counts cooccurrences of cues and outcomes (but cf. =-=Miller & Matzel, 1988-=-). Backward blocking is an analogous phenomenon that occurs when the phases of training are reversed. The first phase involves A.B3X trials and the later phase involves A3X trials. Despite the fact th... |

60 | Vision sciences - Palmer - 1999 |

56 |
Base Rates in Category Learning
- Kruschke
- 1996
(Show Context)
Citation Context ...l subjects). Highlighting or the inverse base rate effect has been obtained in many different experiments using different stimuli, procedures, and cover stories, such as fictitious disease diagnosis (=-=Kruschke, 1996-=-; Medin & Edelson, 1988), random word association (Dennis & Kruschke, 1998), and geometric figure association (Fagot, Kruschke, Dépy, & Vauclair, 1998). Many other published experiments have obtained ... |

56 |
Children’s causal inferences from indirect evidence: Backwards blocking and Bayesian reasoning in preschoolers
- Sobel, Tenenbaum, et al.
- 2004
(Show Context)
Citation Context ...odels of learning that are trial-order invariant include Bayesian neural networks (e.g., MacKay, 2003; Neal, 1996), sigmoid belief networks (Courville et al., 2004), and noisy-OR causal models (e.g., =-=Sobel, Tenenbaum, & Gopnik, 2004-=-; Tenenbaum & Griffiths, 2003). Trial-order invariance is a deficiency for many existing Bayesian models that are intended to address human and animal learning, which can be highly sensitive to trial ... |

54 |
Bayesian Methods: A Social and Behavioral Sciences Approach
- Gill
- 2002
(Show Context)
Citation Context ...d show that the approximations are good under the circumstances we use. Much previous work has gone into each of these approaches (see, for example, textbooks by Gelman, Carlin, Stern, & Rubin, 2004; =-=Gill, 2002-=-; MacKay, 2003). The New Approach: Locally Bayesian Updating There is another approach to the problem: Jettison the goal of being globally Bayesian and instead assume only that each module is Bayesian... |

49 |
Problem structure and the use of base-rate information from experience
- Medin, Edelson
- 1988
(Show Context)
Citation Context ...hlighting or the inverse base rate effect has been obtained in many different experiments using different stimuli, procedures, and cover stories, such as fictitious disease diagnosis (Kruschke, 1996; =-=Medin & Edelson, 1988-=-), random word association (Dennis & Kruschke, 1998), and geometric figure association (Fagot, Kruschke, Dépy, & Vauclair, 1998). Many other published experiments have obtained the inverse base rate e... |

44 | Attention in learning - Kruschke - 2003 |

43 |
On Optimally Combining Pieces of Information, with Application to Estimating 3-D Complex-Object Position from Range Data
- Bolle
- 1986
(Show Context)
Citation Context ...y Bayesian. This sort of functional localization of Bayesian updating should not be confused with spatially local, but functionally parallel, Bayesian updating in models of pattern recognition (e.g., =-=Bolle & Cooper, 1986-=-). Notice also that what is being locally updated in the present scheme is the probability distribution over possible parameter values within a layer; this process is not local updating of a single pa... |

43 | Gain adaptation beats least squares - Sutton - 1992 |

42 |
Learning and selective attention
- Dayan, Kakade, et al.
- 2000
(Show Context)
Citation Context ... trial order is the Kalman filter model, which was introduced to associative learning researchers by Sutton (1992) and has been further developed by Dayan, Kakade, et al. (e.g., Dayan & Kakade, 2001; =-=Dayan, Kakade, & Montague, 2000-=-; Dayan & Yu, 2003; Kakade & Dayan, 2002). In a Kalman filter, outcomes are computed as a weighted sum of input cues. The weighting coefficients are the values that are being learned. The weights have... |

41 |
Blocking and backward blocking involve learned inattention
- Kruschke, Blair
- 2000
(Show Context)
Citation Context ..., however, the amount of reduction in backward blockingLOCALLY BAYESIAN LEARNING 689 is weaker than the amount of reduction in forward blocking (see e.g., Beckers, De Houwer, Pineño, & Miller, 2005; =-=Kruschke & Blair, 2000-=-; Lovibond, Been, Mitchell, Bouton, & Frohardt, 2003; Pineño, Urushihara, & Miller, 2005; Shanks, 1985). This asymmetry in strengths of forward and backward blocking is a trialorder effect that is cha... |

40 | Trial order affects cue interaction in contingency judgment
- Chapman
- 1991
(Show Context)
Citation Context ...ining has trials of A.B3 ¬X. The result is that B becomes an inhibitor of response X. In backward conditioned inhibition, the phases of training are reversed. B becomes an inhibitor in this case too (=-=Chapman, 1991-=-; Larkin et al., 1998; Melchers et al., 2004). Backward conditioned inhibition is the same structure as unovershadowing but with the roles of the outcomes reversed: What was a present outcome is now a... |

36 | Understanding the Kalman Filter
- MEINHOLD, SINGPURWALLA
- 1983
(Show Context)
Citation Context ...ues that are being learned. The weights have prior belief distributions defined as multivariate normal. The Kalman filter uses Bayesian updating to adjust the probability distribution on the weights (=-=Meinhold & Singpurwalla, 1983-=-). Because the model is linear, the posterior distributions on the weights are also multivariate normal, and the Kalman filter equations elegantly express the posterior mean and covariance as a simple... |

36 |
Backward blocking and recovery from overshadowing in human causal judgment: The role of within-compound associations
- Wasserman, Berglan
- 1998
(Show Context)
Citation Context ...o account here of learned attention in backward blocking (Kruschke & Blair, 2000). There is no account of the dependence of backward blocking on within-compound associations (Dickinson & Burke, 1996; =-=Wasserman & Berglan, 1998-=-). There is no account of second-order blocking (De Houwer & Beckers, 2002b), or of the influence of additive targets (Lovibond et al., 2003), or of spontaneous recovery from blocking (Pineño et al., ... |

32 |
Learning Bayesian Networks. Upper Saddle River
- Neapolitan
- 2003
(Show Context)
Citation Context ...rsive procedure is tractable when the integral at each level can be analytically formulated. An example is when each probability density function is a linear transformation with Gaussian noise (e.g., =-=Neapolitan, 2004-=-, Ch. 4). If instead each integral must be numerically approximated, the situation becomes very computationally demanding. It may be the case that the computations can be significantly economized by u... |

29 | Acquisition and extinction in autoshaping
- Kakade, Dayan
- 2002
(Show Context)
Citation Context ...introduced to associative learning researchers by Sutton (1992) and has been further developed by Dayan, Kakade, et al. (e.g., Dayan & Kakade, 2001; Dayan, Kakade, & Montague, 2000; Dayan & Yu, 2003; =-=Kakade & Dayan, 2002-=-). In a Kalman filter, outcomes are computed as a weighted sum of input cues. The weighting coefficients are the values that are being learned. The weights have prior belief distributions defined as m... |

29 | Eye Gaze and Individual Differences Consistent with Learned Attention in Associative Blocking and Highlighting
- Kruschke, Kappenman, et al.
- 2005
(Show Context)
Citation Context ...ntion during learning and the learning (i.e., retention) of those shifts. The theory has been implemented in error-driven connectionist models called ADIT and EXIT (e.g., Kruschke, 1996, 2001b, 2005; =-=Kruschke, Kappenman, & Hetrick, 2005-=-). The present model captures some of the same ideas as the EXIT model but with associative weight changes driven by Bayesian updating and with attentional shifts driven by maximization of outcome pro... |

29 | Theory-based causal inference
- Tenenbaum, Griffiths
- 2002
(Show Context)
Citation Context ...rder invariant include Bayesian neural networks (e.g., MacKay, 2003; Neal, 1996), sigmoid belief networks (Courville et al., 2004), and noisy-OR causal models (e.g., Sobel, Tenenbaum, & Gopnik, 2004; =-=Tenenbaum & Griffiths, 2003-=-). Trial-order invariance is a deficiency for many existing Bayesian models that are intended to address human and animal learning, which can be highly sensitive to trial order. One notable Bayesian m... |

27 | LMS rules and the inverse base-rate effect: Comment on Gluck and Bower
- Markman
- 1989
(Show Context)
Citation Context ...r backward blocking. Extensions of the model, that assume absent cues have a negative impact on learning, can account for backward blocking or other effects (Dickinson & Burke, 1996; Ghirlanda, 2005; =-=Markman, 1989-=-; Tassoni, 1995; Van Hamme & Wasserman, 1994). The models assert that only absent cues that are expected to be present should have negative impact, but the exact computations regarding which cues are ... |

27 |
Attention in learning: Theory and research
- Trabasso, Bower
- 1968
(Show Context)
Citation Context ...arning is that people and animals do, in fact, learn to attend to cues that are diagnostic for correct responses and learn to ignore cues that are irrelevant (e.g., Kruschke, 2003a; Mackintosh, 1975; =-=Trabasso & Bower, 1968-=-). Thus, when learning to associate cues with (overt) responses, people are also learning to associate cues with (covert) attentional distributions over those cues. The model is a simplistic instantia... |

26 | Explaining away in weight space
- Dayan, Kakade
- 2000
(Show Context)
Citation Context ...l that is sensitive to trial order is the Kalman filter model, which was introduced to associative learning researchers by Sutton (1992) and has been further developed by Dayan, Kakade, et al. (e.g., =-=Dayan & Kakade, 2001-=-; Dayan, Kakade, & Montague, 2000; Dayan & Yu, 2003; Kakade & Dayan, 2002). In a Kalman filter, outcomes are computed as a weighted sum of input cues. The weighting coefficients are the values that ar... |

26 | A review of recent developments in research and theories on human contingency learning - Houwer, Beckers - 2002 |

26 |
Pavlovian conditioned inhibition
- Rescorla
- 1969
(Show Context)
Citation Context ...backward conditioned inhibition results in a lack of response, which can only be indirectly observed. Traditional indirect tests of conditioned inhibition include the summation and retardation tests (=-=Rescorla, 1969-=-). Fortunately, when assessing the model, we do not need to rely merely on overt responses to infer hidden inhibitory links because we can peer inside the model and see the actual associative strength... |

25 | Dynamical causal learning - Danks, Griffiths, et al. - 2003 |

23 | Model uncertainty in classical conditioning
- Courville, Daw, et al.
- 2004
(Show Context)
Citation Context ..., will not be exhibited by such Bayesian models. Existing models of learning that are trial-order invariant include Bayesian neural networks (e.g., MacKay, 2003; Neal, 1996), sigmoid belief networks (=-=Courville et al., 2004-=-), and noisy-OR causal models (e.g., Sobel, Tenenbaum, & Gopnik, 2004; Tenenbaum & Griffiths, 2003). Trial-order invariance is a deficiency for many existing Bayesian models that are intended to addre... |

22 |
A nonassociative aspect of overshadowing
- Kaufman, Bolles
- 1981
(Show Context)
Citation Context ...one leading to the absence of outcome X. When B is then tested alone, it elicits outcome X more strongly, despite the fact that it never occurred in the later phase of training (Beckers et al., 2005; =-=Kaufman & Bolles, 1981-=-; Larkin, Aitken, & Dickinson, 1998; Lovibond et al., 2003; Matzel, Schachtman, & Miller, 1985; Melchers, Lachnit, & Shanks, 2004; Wasserman & Berglan, 1998). Unovershadowing is the Holmesian logic of... |