## Toward a method of selecting among computational models of cognition (2002)

### Cached

### Download Links

Venue: | Psychological Review |

Citations: | 75 - 4 self |

### BibTeX

@ARTICLE{Pitt02towarda,

author = {Mark A. Pitt and In Jae Myung and Shaobo Zhang},

title = {Toward a method of selecting among computational models of cognition},

journal = {Psychological Review},

year = {2002},

volume = {109},

pages = {472--491}

}

### Years of Citing Articles

### OpenURL

### Abstract

The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models of cognition are increasingly being advanced as explanations of behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these models. This article introduces a method of selecting among mathematical models of cognition known as minimum description length, which provides an intuitive and theoretically well-grounded understanding of why one model should be chosen. A central but elusive concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in 3 areas of cognitive modeling: psychophysics, information integration, and categorization. How should one choose among competing theoretical explanations of data? This question is at the heart of the scientific enterprise, regardless of whether verbal models are being tested in an experimental setting or computational models are being evaluated in simulations. A number of criteria have been proposed to assist in this endeavor, summarized nicely by Jacobs and Grainger

### Citations

2299 |
Estimating the dimension of a model,” The
- SCHWARZ
- 1978
(Show Context)
Citation Context ... experimental task. Six representative selection methods currently in use are shown in Table 2. They are the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; =-=Schwarz, 1978-=-), the root mean squared deviation (RMSD), the informationtheoretic measure of complexity (ICOMP; Bozdogan, 1990), cross-validation (CV; Stone, 1974), and Bayesian model selection (BMS; Kass & Raftery... |

1680 | An introduction to Kolmogorov complexity and its applications - Li, Vitányi - 1997 |

1234 | Bayesian Data Analysis - Gelman, Carlin, et al. - 1995 |

1225 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ...e shape of error function is completely specified by the experimental task. Six representative selection methods currently in use are shown in Table 2. They are the Akaike information criterion (AIC; =-=Akaike, 1973-=-), the Bayesian information criterion (BIC; Schwarz, 1978), the root mean squared deviation (RMSD), the informationtheoretic measure of complexity (ICOMP; Bozdogan, 1990), cross-validation (CV; Stone,... |

979 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ... Schwarz, 1978), the root mean squared deviation (RMSD), the informationtheoretic measure of complexity (ICOMP; Bozdogan, 1990), cross-validation (CV; Stone, 1974), and Bayesian model selection (BMS; =-=Kass & Raftery, 1995-=-; Myung & Pitt, 1997). Each of these methods assesses a model’s generalizability by combining a measure of GOF with a measure of complexity. Each prescribes that the model that minimizes the given cri... |

720 |
Cross-validatory choice and assessement of statistical predictions
- STONE
- 1974
(Show Context)
Citation Context ..., 1973), the Bayesian information criterion (BIC; Schwarz, 1978), the root mean squared deviation (RMSD), the informationtheoretic measure of complexity (ICOMP; Bozdogan, 1990), cross-validation (CV; =-=Stone, 1974-=-), and Bayesian model selection (BMS; Kass & Raftery, 1995; Myung & Pitt, 1997). Each of these methods assesses a model’s generalizability by combining a measure of GOF with a measure of complexity. E... |

544 | Markov Chain Monte Carlo in Practice - Gilks, Richardson, et al. - 1996 |

531 |
Probability and Statistics
- DeGroot
- 1986
(Show Context)
Citation Context ...ormation matrix of a sample of size 1, det(I) is the determinant of the matrix I, and d� the infinitesimal parameter volume (see Footnote 6 for a definition of the Fisher Information matrix; see also =-=Schervish, 1995-=-). The number of all distinguishable probability distributions that a model can generate or describe is obtained by integrating d�{det[I(�)]} 1/2 over the entire parameter manifold as follows: V M �� ... |

397 |
A universal prior for integers and estimation by minimum description length
- Rissanen
- 1983
(Show Context)
Citation Context ...on method. What is missing is a measure of how well the model fits the data (i.e., a measure of GOF). MDL, a model selection method from algorithmic coding theory in computer science (Grunwald, 2000; =-=Rissanen, 1983-=-, 1996) combines both of these measures. The MDL approach to model selection was developed within the domain of information theory, where the goal of model selection is to choose the model that permit... |

380 | A theory of memory retrieval
- Ratcliff
- 1978
(Show Context)
Citation Context ...d js. As noted above, different parameter ranges will yield different complexity values. A challenge in computing geometric complexity arises for algorithmic models, such as random-walk models (e.g., =-=Ratcliff, 1978-=-). The likelihood function that predicts a model’s performance for any given stimulus condition is not defined a priori. Rather, a prediction is obtained only by simulating the model for each given st... |

312 |
Dynamic Patterns: The self-organization of brain and behavior
- Kelso
- 1995
(Show Context)
Citation Context ...mposed of (y t1 , y t2 ) created by plotting the y values at t 1 against the corresponding y values at t 2 for the full range of the parameter �, similar to phase plots in dynamical systems research (=-=Kelso, 1995-=-). In essence, a model is represented graphically as a plot of y t1 versus y t2 in data space. For example, for the parameter � � 1, the y value at t 1 � 2 is obtained as y t1 � (t 1) �� � (2) �1 � 0.... |

275 |
Methods of Information Geometry
- Amari, Nagaoka
- 2000
(Show Context)
Citation Context ... Myung, Balasubramanian, and Pitt (2000). Within differential geometry, a model forms a geometric object known as a Riemannian manifold that is embedded in the space of all probability distributions (=-=Amari, 1983-=-, 1985; Rao, 1945). As in PITT, MYUNG, AND ZHANG the data space depicted in Figure 3, every distribution is a point in this space, and the collection of points created by varying the parameters of the... |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...verning the cognitive process of interest. The full form of the measure is shown below. The first term is the GOF measure and the second and third together form the intrinsic complexity of the model (=-=Rissanen, 1996-=-). MDL � �ln f�y� ˆ�� � k n ln� � 2 2�� ln� d��det�I�� ��, (7) where y � (y 1,..., y n) is a data sample of size n, ˆ� is the maximum likelihood parameter estimate, ln is the natural logarithm of base... |

229 |
Nonlinear Regression Analysis and its Applications
- Bates, Watts
- 1988
(Show Context)
Citation Context ...ntitative measure of complexity. RSA is a method for studying geometric relations among responses generated by a mathematical model, often used in nonlinMODEL SELECTION AND COMPLEXITY ear regression (=-=Bates & Watts, 1988-=-). For a model with k parameters and N observations, the response surface is defined as a k-dimensional surface, formed by all possible response vectors that the model can describe. The response surfa... |

205 | Minimum complexity density estimation - Barron, Cover - 1991 |

186 |
Discrete Multivariate Analysis
- Bishop, Fienberg, et al.
- 1975
(Show Context)
Citation Context ...ed to a special case of the latter by setting one or more of its parameters to fixed values.) On the other hand, the generalized likelihood ratio test based on the G 2 or chi-square statistics (e.g., =-=Bishop, Fienberg, & Holland, 1975-=-, pp. 125–127), which are often used to compare two models, assumes that the models are nested and, further, that the reduced model is correct. When these assumptions are met, both types of selection ... |

151 |
Pattern recognition and categorization
- Reed
- 1972
(Show Context)
Citation Context ...re complex model. Categorization Two models of categorization were considered in the present demonstration. They were the generalized context model (GCM; Nosofsky, 1986) and the prototype model (PRT; =-=Reed, 1972-=-). Each model assumes that categorization responses follow a multinomial probability distribution with p iJ (probability of category C J response given stimulus X i), which is given by: GCM: p iJ � � ... |

147 |
Information and accuracy attainable in the estimation of statistical parameters
- Rao
- 1945
(Show Context)
Citation Context ...nian, and Pitt (2000). Within differential geometry, a model forms a geometric object known as a Riemannian manifold that is embedded in the space of all probability distributions (Amari, 1983, 1985; =-=Rao, 1945-=-). As in PITT, MYUNG, AND ZHANG the data space depicted in Figure 3, every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise... |

144 |
Foundations of Information Integration Theory
- Anderson
- 1981
(Show Context)
Citation Context ... stimulus dimensions. For this comparison, we consider two models of information integration, the fuzzy logical model of perception (FLMP; Oden & Massaro, 1978) and the linear integration model (LIM; =-=Anderson, 1981-=-). Each assumes that the response probability ( p ij) of one category, say A, on the presentation of a stimulus of the specific i and j feature dimensions in a two-factor information integration exper... |

119 |
How persuasive is a good fit? A comment on theory testing
- Roberts, Pashler
- 2000
(Show Context)
Citation Context ...ous results and the choice of an inferior model. Just because a model fits data well does not necessarily imply that the regularity one seeks to capture in the data is well approximated by the model (=-=Roberts & Pashler, 2000-=-). Properties of the model itself can enable it to provide a good fit to the data for reasons that have nothing to do with the model’s approximation to the cognitive process (Myung, 2000). Two of thes... |

115 |
Model Selection
- Linhart, Zucchini
- 1986
(Show Context)
Citation Context ... of a particular data sample. More formally, generalizability can be defined in terms of a discrepancy function that measures the expected error in predicting future data given the model of interest (=-=Linhart & Zucchini, 1986-=-; also see their work for a discussion of the theoretical underpinnings of generalizability). The results of a second simulation illustrate the superiority of generalizability as a model selection cri... |

79 |
Attention, similarity, and the identification– categorization relationship
- Nosofsky
- 1986
(Show Context)
Citation Context ...ce and minimizing overgeneralization of the more complex model. Categorization Two models of categorization were considered in the present demonstration. They were the generalized context model (GCM; =-=Nosofsky, 1986-=-) and the prototype model (PRT; Reed, 1972). Each model assumes that categorization responses follow a multinomial probability distribution with p iJ (probability of category C J response given stimul... |

78 |
On the nature of expected utility
- Fishburn
- 1979
(Show Context)
Citation Context ... frequency conditions, how frequency is related to response latency (e.g., linearly or logarithmically), or the shape of the response time distribution. The axiomatic theory of decision making (e.g., =-=Fishburn, 1982-=-) is another example of qualitative modeling. The theory is formulated in rigorous mathematical language and makes precise predictions about choice behavior given a set of hypothetical gambles, but it... |

70 |
Applying Occam’s razor in modeling cognition: A Bayesian approach.Psychonomic Bulletin
- Myung, Pitt
- 1997
(Show Context)
Citation Context ... most important (and new) technical advances are discussed. A more thorough treatment of the mathematics can be found in other sources (Myung, Balasubramanian, & Pitt, 2000; Myung, Kim, & Pitt, 2000; =-=Myung & Pitt, 1997-=-, 1998). After introducing the problem of model selection and identifying model complexity as a key property of a model that must be considered by any selection method, we introduce an intuitive stati... |

57 | Hypothesis selection and testing by the MDL principle. The Computer Journal 42
- Rissanen
- 1999
(Show Context)
Citation Context ...selects the one model, among a set of competing models, that minimizes the expected error in predicting future data in which the prediction error is measured using a logarithmic discrepancy function (=-=Rissanen, 1999-=-; Yamanishi, 1998). It turns out that minimization of MDL corresponds to maximization of the posterior probability within the Bayesian statistics framework (i.e., BMS). Balasubramanian (1997) showed t... |

56 |
The Importance of Complexity in Model Selection
- Myung
- 2000
(Show Context)
Citation Context ... (Roberts & Pashler, 2000). Properties of the model itself can enable it to provide a good fit to the data for reasons that have nothing to do with the model’s approximation to the cognitive process (=-=Myung, 2000-=-). Two of these properties are the number of parameters in the model and its functional form (i.e., the way in which the model’s parameters and data are combined in the model equation). Together they ... |

54 | Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions
- Balasubramanian
- 1997
(Show Context)
Citation Context ...approach, this corresponds to a volume measure in the space of probability distributions. The following volume, under the assumption of large sample size, is shown to be a valid measure of proximity (=-=Balasubramanian, 1997-=-; Myung, Balasubramanian, & Pitt, 2000): C M � (2�/n) k/2 h( ˆ�), where k is the number of parameters in the model and h( ˆ�) is a datadependent factor that goes to 1 as n grows large (some additional... |

49 | Differential geometric methods in statistics - Amari - 1985 |

49 | Integration of featural information in speech perception
- NORRIS, Oden, et al.
- 1978
(Show Context)
Citation Context ... responses in one category across the various combinations of stimulus dimensions. For this comparison, we consider two models of information integration, the fuzzy logical model of perception (FLMP; =-=Oden & Massaro, 1978-=-) and the linear integration model (LIM; Anderson, 1981). Each assumes that the response probability ( p ij) of one category, say A, on the presentation of a stimulus of the specific i and j feature d... |

49 |
A decision-theoretic extension of stochastic complexity and its applications to learning
- Yamanishi
- 1998
(Show Context)
Citation Context ...model, among a set of competing models, that minimizes the expected error in predicting future data in which the prediction error is measured using a logarithmic discrepancy function (Rissanen, 1999; =-=Yamanishi, 1998-=-). It turns out that minimization of MDL corresponds to maximization of the posterior probability within the Bayesian statistics framework (i.e., BMS). Balasubramanian (1997) showed that the MDL crite... |

44 |
Measurement theory with applications to decision making, utility and the social sciences
- Roberts
- 1979
(Show Context)
Citation Context ...lihoods482 Table 3 Comparison of Four Selection Methods on Their Ability to Generalize Accurately Using Two Psychophysical Models Selection method/ model fitted Psychophysics Models of psychophysics (=-=Roberts, 1979-=-) were developed to describe the relationship between physical dimensions (e.g., light intensity) and their psychological counterparts (e.g., brightness). Two of the most influential have been Stevens... |

41 | Counting probability distributions: Differential geometry and model selection
- Myung, Balasubramanian, et al.
- 2000
(Show Context)
Citation Context ...lection and the solution being advocated. Consequently, only the most important (and new) technical advances are discussed. A more thorough treatment of the mathematics can be found in other sources (=-=Myung, Balasubramanian, & Pitt, 2000-=-; Myung, Kim, & Pitt, 2000; Myung & Pitt, 1997, 1998). After introducing the problem of model selection and identifying model complexity as a key property of a model that must be considered by any sel... |

38 | Model selection based on minimum description length
- Grünwald
- 2000
(Show Context)
Citation Context ... a model selection method. What is missing is a measure of how well the model fits the data (i.e., a measure of GOF). MDL, a model selection method from algorithmic coding theory in computer science (=-=Grunwald, 2000-=-; Rissanen, 1983, 1996) combines both of these measures. The MDL approach to model selection was developed within the domain of information theory, where the goal of model selection is to choose the m... |

33 | mat minima
- Hoclmiter, Schmidhuber
- 1997
(Show Context)
Citation Context ...e sort of an algorithmbased estimate of geometric complexity that in essence implements MDL in principle but does not require the derivation of the Fisher information matrix or its integration (e.g., =-=Hochreiter & Schmidhuber, 1997-=-). Future Work and Other Issues Testing qualitative models of cognition. Application of MDL and geometric complexity require that each of the models being compared be quantitative models that can be e... |

27 | Models of visual word recognition: Sampling the state of the art - Jacobs, Grainger - 1994 |

25 |
On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models
- Bozdogan
- 1990
(Show Context)
Citation Context ...aike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), the root mean squared deviation (RMSD), the informationtheoretic measure of complexity (ICOMP; =-=Bozdogan, 1990-=-), cross-validation (CV; Stone, 1974), and Bayesian model selection (BMS; Kass & Raftery, 1995; Myung & Pitt, 1997). Each of these methods assesses a model’s generalizability by combining a measure of... |

25 | Toward an explanation of the power law artifact: Insights from response surface analysis
- Myung, Kim, et al.
- 2000
(Show Context)
Citation Context ...ed. Consequently, only the most important (and new) technical advances are discussed. A more thorough treatment of the mathematics can be found in other sources (Myung, Balasubramanian, & Pitt, 2000; =-=Myung, Kim, & Pitt, 2000-=-; Myung & Pitt, 1997, 1998). After introducing the problem of model selection and identifying model complexity as a key property of a model that must be considered by any selection method, we introduc... |

22 |
Selectivity, Scope, and Simplicity of Models: A lesson from fitting judgments of perceived depth
- Cutting, Bruno, et al.
- 1992
(Show Context)
Citation Context ...fferent functional forms. For example, how does one compare the functional forms of the logarithmic and exponential models in Table 1? The literature has been relatively silent on this issue (but see =-=Cutting et al., 1992-=-; Townsend, 1975). As we show above, differential geometry not only provides a solution, but the solution is intuitive. Complexity is conceptualized as counting explanations (i.e., distinguishable pro... |

17 | Model selection [Special issue - Myung, Forster, et al. - 2000 |

16 |
Acomparison of learning models
- Friedman, Massaro, et al.
- 1995
(Show Context)
Citation Context ...sample size, n, will now be equal to 1 whereas the data size, N, remains unchanged. 4 The RMSD defined in Table 2 differs from the RMSD that has often been used in the psychological literature (e.g., =-=Friedman, Massaro, Kitzis, & Cohen, 1995-=-) where it is defined as RMSD � �SSE/N, in which (N � k) is replaced by N, and therefore does not take into account the number of parameters. This form of RMSD is nothing more than RMSE. As such, it i... |

10 | Statistical tests for comparing possibly misspecified and nonnested models - Golden - 2000 |

9 | Using parameter sensitivity and interdependence to predict model scope and falsifiability - Li, Lewandowsky, et al. - 1996 |

8 | Psychophysics of sensory function - Stevens - 1960 |

5 | How many parameters can a model have and still be testable - Bamber, Santen - 1985 |

5 | Model complexity: The fit to random data reconsidered - Dunn - 2000 |

5 | Issues in selecting mathematical models of cognition - Myung, Pitt - 1998 |

5 |
Mathematical modeling
- Myung, Pitt
- 2002
(Show Context)
Citation Context ...or from a probability distribution with a mean of zero. Quite often the meansFigure 2. Illustration of the relationship between goodness of fit and generalizability as a function of model complexity (=-=Myung & Pitt, 2001-=-). From Stevens’ Handbook of Experimental Psychology (p. 449, Figure 11. 4), by J. Wixted (Editor), 2001, New York: Wiley. Copyright 2001 by Wiley. Adapted with permission. function g(�, x) itself is ... |

2 | Similarity-scaling studies of dot-patten classification and recognition - Shin - 1992 |

1 | 29–August 1). Determining the complexity of arbitrary model classes. Paper presented at the 32nd annual meeting of the Society for Mathematical Psychology - Grunwald - 1999 |

1 |
The mind–body problem revisited
- Townsend
- 1975
(Show Context)
Citation Context ...f a model is essential to model selection. Furthermore, the geometric complexity results validate a long-held suspicion regarding the source of the superior data-fitting abilities of Stevens’s model (=-=Townsend, 1975-=-). Information Integration In a typical information integration experiment, a range of stimuli is generated from a factorial manipulation of two or more stimulus dimensions (e.g., visual and auditory)... |