## Modeling for Optimal Probability Prediction (2002)

### Cached

### Download Links

Venue: | In Proceedings of the Nineteenth International Conference on Machine Learning |

Citations: | 7 - 0 self |

### BibTeX

@INPROCEEDINGS{Wang02modelingfor,

author = {Yong Wang and Ian H. Witten},

title = {Modeling for Optimal Probability Prediction},

booktitle = {In Proceedings of the Nineteenth International Conference on Machine Learning},

year = {2002},

pages = {650--657},

publisher = {Morgan Kaufmann}

}

### OpenURL

### Abstract

We present a general modeling method for optimal probability prediction over future observations, in which model dimensionality is determined as a natural by-product. This new method yields several estimators, and we establish theoretically that they are optimal (either overall or under stated restrictions) when the number of free parameters is infinite.

### Citations

4934 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...ribes how to handle mixtures of non-central # 2 1 's. There are many applications for techniques that select amongst nested models---for example, pruning tree-structured models (Breiman et al., 1984; =-=Quinlan, 1993-=-). 4. Case study: Fitting logistic models Now it is time to apply the general results we have established to the special case of logistic regression models, and present simulation results. 4.1 Logisti... |

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...of #. Wang (2000) describes how to handle mixtures of non-central # 2 1 's. There are many applications for techniques that select amongst nested models---for example, pruning tree-structured models (=-=Breiman et al., 1984-=-; Quinlan, 1993). 4. Case study: Fitting logistic models Now it is time to apply the general results we have established to the special case of logistic regression models, and present simulation resul... |

2868 | P.: UCI Repository of Machine Learning Databases - Merz, Merphy - 1996 |

2307 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...ace regression" (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE/AIC/C p (Akaike, 1969, 1973; Mallows, 1973), BIC/=-=MDL (Schwarz, 1978-=-; Rissanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROT... |

1832 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1994
(Show Context)
Citation Context ... and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (Breiman, 1995), and LASSO (=-=Tibshirani, 1996-=-). The new approach, which adopts a methodology that resembles empirical Bayes (Robbins, 1955, 1964), was shown to always rival, and generally outperform, all these techniques in terms of predictive s... |

1235 | Information theory and an extension of the maximum likelihood principle - Akaike - 1973 |

1203 | Categorical Data Analysis - Agresti - 1990 |

1160 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ... (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE/AIC/C p (Akaike, 1969, 1973; Mallows, 1973), BIC/MDL (Schwarz, 1978; =-=Rissanen, 1978-=-), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (Breiman, 1995... |

550 | Ideal spatial adaption via Wavelet Shrinkage”, Biometrika - Donoho, Johnstone - 1994 |

495 | Ridge regression: Biased estimation for nonorthogonal problems - Hoerl, Kennard - 1970 |

209 |
Some comments on cp
- Mallows
- 1973
(Show Context)
Citation Context ...g linear models called "pace regression" (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE/AIC/C p (Akaike, 19=-=69, 1973; Mallows, 1973-=-), BIC/MDL (Schwarz, 1978; Rissanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and ... |

139 |
Fitting autoregressive models for prediction
- Akaike
- 1969
(Show Context)
Citation Context ...w approach to fitting linear models called "pace regression" (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE=-=/AIC/C p (Akaike, 1969-=-, 1973; Mallows, 1973), BIC/MDL (Schwarz, 1978; Rissanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge r... |

126 | The risk inflation criterion for multiple regression - Foster, George - 1994 |

103 |
Better subset regression using the nonnegative garrote
- Breiman
- 1995
(Show Context)
Citation Context ...issanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (=-=Breiman, 1995-=-), and LASSO (Tibshirani, 1996). The new approach, which adopts a methodology that resembles empirical Bayes (Robbins, 1955, 1964), was shown to always rival, and generally outperform, all these techn... |

70 |
An empirical Bayes approach to statistics
- Robbins
- 1955
(Show Context)
Citation Context ...kage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (Breiman, 1995), and LASSO (Tibshirani, 1996). The new approach, which adopts a methodology that resembles empirical Bayes (=-=Robbins, 1955-=-, 1964), was shown to always rival, and generally outperform, all these techniques in terms of predictive squared error. Moreover, it determines the dimensionality of the model as a natural by-product... |

69 |
Nonparametric maximum likelihood estimation of a mixing distribution
- Laird
- 1978
(Show Context)
Citation Context ...timate of this function. This interpretation leads to the same result as the Bayesian framework. Many consistent estimators of an arbitrary G are available in the literature, including the MLE (e.g., =-=Laird, 1978-=-; Bohning et al., 1992; Lesperance & Kalbfleisch, 1992) and some minimum distance estimators (e.g. Choi & Bulgren, 1968; Deely & Kruse, 1968; Macdonald, 1971; Blum & Susarla, 1977; Wang, 2000). Here, ... |

38 | On information and su�ciency - Kullback, Leibler - 1951 |

35 | The empirical Bayes approach to statistical decision problems - Robbins - 1964 |

25 | The covariance inflation criterion for adaptive model selection - TIBSHIRANI, K - 1999 |

23 | An algorithm for computing the nonparametric MLE of a mixing distribution - Lesperance, Kalbfleisch - 1992 |

10 | Construction of sequences estimating the mixing distribution - Deely, Kruse - 1968 |

9 |
An estimation procedure for mixtures of distributions
- Choi, Bulgren
- 1968
(Show Context)
Citation Context ...mators of an arbitrary G are available in the literature, including the MLE (e.g., Laird, 1978; Bohning et al., 1992; Lesperance & Kalbfleisch, 1992) and some minimum distance estimators (e.g. Choi & =-=Bulgren, 1968-=-; Deely & Kruse, 1968; Macdonald, 1971; Blum & Susarla, 1977; Wang, 2000). Here, consistency means that Pr( lim k## G k (#) = G(#), # any continuity point of G) = 1. (3) 2.2 Estimation for multinormal... |

6 | Estimation of a mixing distribution function - Blum, Susarla - 1977 |

5 |
Information Theory and Statistics, 2nd ed
- Kullback
- 1968
(Show Context)
Citation Context ...e two distribution functionssf and f , wheresf is an estimate of the probability density function (pdf) f(x). This measure is #KL (f, f) = E f log(f/ f) = # log(f/ f)f dx (Kullback and Leibler, 1951; =-=Kullback, 1968-=-). This is appropriate because, almost surely, # n # i=1 f(x i ) f(x i ) # 1 n # e -#KL (f,sf) as n # #, (1) where x 1 , . . . , xn are iid from f(X). The goal of the modeling process that is most int... |

4 |
Comment on a paper by Choi and
- Macdonald
- 1971
(Show Context)
Citation Context ...le in the literature, including the MLE (e.g., Laird, 1978; Bohning et al., 1992; Lesperance & Kalbfleisch, 1992) and some minimum distance estimators (e.g. Choi & Bulgren, 1968; Deely & Kruse, 1968; =-=Macdonald, 1971-=-; Blum & Susarla, 1977; Wang, 2000). Here, consistency means that Pr( lim k## G k (#) = G(#), # any continuity point of G) = 1. (3) 2.2 Estimation for multinormal distributions A special case of the a... |

3 | C.A.MAN (computer assisted analysis of mixtures): Statistical algorithms - Bohning, Schlattmann - 1992 |

3 | Classical Inference and the Linear Model, Volume 2A of Kendall's Advanced Theory of Statistics - Stuart, Ord - 1999 |