## Modeling for Optimal Probability Prediction (2002)

### Cached

### Download Links

Venue: | In Proceedings of the Nineteenth International Conference on Machine Learning |

Citations: | 8 - 0 self |

### BibTeX

@INPROCEEDINGS{Wang02modelingfor,

author = {Yong Wang and Ian H. Witten},

title = {Modeling for Optimal Probability Prediction},

booktitle = {In Proceedings of the Nineteenth International Conference on Machine Learning},

year = {2002},

pages = {650--657},

publisher = {Morgan Kaufmann}

}

### OpenURL

### Abstract

We present a general modeling method for optimal probability prediction over future observations, in which model dimensionality is determined as a natural by-product. This new method yields several estimators, and we establish theoretically that they are optimal (either overall or under stated restrictions) when the number of free parameters is infinite.

### Citations

5328 |
5: Programs for Machine Learning
- Quinlan, C4
- 1993
(Show Context)
Citation Context ...ribes how to handle mixtures of non-central # 2 1 's. There are many applications for techniques that select amongst nested models---for example, pruning tree-structured models (Breiman et al., 1984; =-=Quinlan, 1993-=-). 4. Case study: Fitting logistic models Now it is time to apply the general results we have established to the special case of logistic regression models, and present simulation results. 4.1 Logisti... |

4315 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ...of #. Wang (2000) describes how to handle mixtures of non-central # 2 1 's. There are many applications for techniques that select amongst nested models---for example, pruning tree-structured models (=-=Breiman et al., 1984-=-; Quinlan, 1993). 4. Case study: Fitting logistic models Now it is time to apply the general results we have established to the special case of logistic regression models, and present simulation resul... |

3029 | UCI repository of machine learning databases. Available on-line: http://www.ics.uci.edu/âˆ¼mlearn/MLRepository.html - Blake, Merz - 1998 |

2647 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...ace regression" (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE/AIC/C p (Akaike, 1969, 1973; Mallows, 1973), BIC/=-=MDL (Schwarz, 1978-=-; Rissanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROT... |

2024 | Regression shrinkage and selection via the LASSO
- Tibshirani
- 1996
(Show Context)
Citation Context ... and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (Breiman, 1995), and LASSO (=-=Tibshirani, 1996-=-). The new approach, which adopts a methodology that resembles empirical Bayes (Robbins, 1955, 1964), was shown to always rival, and generally outperform, all these techniques in terms of predictive s... |

1577 | Categorical Data Analysis - AGRESTI - 2002 |

1541 | Information theory and an extension of the maximum likelihood principle - Akaike - 1973 |

1235 |
Modeling by shortest data description
- Rissanen
- 1978
(Show Context)
Citation Context ... (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE/AIC/C p (Akaike, 1969, 1973; Mallows, 1973), BIC/MDL (Schwarz, 1978; =-=Rissanen, 1978-=-), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (Breiman, 1995... |

597 | Ideal spatial adaptation via wavelet shrinkage - Donoho, Johnstone - 1994 |

555 | Ridge Regression: Biased Estimation for Non-orthogonal Problems - Hoerl, Kennard - 1970 |

246 |
Some comments on Cp
- Mallows
- 1973
(Show Context)
Citation Context ...g linear models called "pace regression" (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE/AIC/C p (Akaike, 19=-=69, 1973; Mallows, 1973-=-), BIC/MDL (Schwarz, 1978; Rissanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and ... |

167 |
Fitting Autoregressive Models for Prediction
- Akaike
- 1969
(Show Context)
Citation Context ...w approach to fitting linear models called "pace regression" (Wang, 2000). Standard techniques for this problem include ordinary least squares (OLS); OLS subset selection methods such as FPE=-=/AIC/C p (Akaike, 1969-=-, 1973; Mallows, 1973), BIC/MDL (Schwarz, 1978; Rissanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge r... |

138 | The Risk Inflation Criterion for Multiple Regression - Foster, George - 1994 |

115 |
Better subset regression using the nonnegative garrote
- Breiman
- 1995
(Show Context)
Citation Context ...issanen, 1978), RIC (Donoho and Johnstone, 1994; Foster and George, 1994), and CIC (Tibshirani and Knight, 1999); and shrinkage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (=-=Breiman, 1995-=-), and LASSO (Tibshirani, 1996). The new approach, which adopts a methodology that resembles empirical Bayes (Robbins, 1955, 1964), was shown to always rival, and generally outperform, all these techn... |

85 |
Nonparametric maximum likelihood estimation of a mixing distribution
- Laird
- 1978
(Show Context)
Citation Context ...timate of this function. This interpretation leads to the same result as the Bayesian framework. Many consistent estimators of an arbitrary G are available in the literature, including the MLE (e.g., =-=Laird, 1978-=-; Bohning et al., 1992; Lesperance & Kalbfleisch, 1992) and some minimum distance estimators (e.g. Choi & Bulgren, 1968; Deely & Kruse, 1968; Macdonald, 1971; Blum & Susarla, 1977; Wang, 2000). Here, ... |

83 |
An empirical Bayes approach to statistics
- Robbins
- 1955
(Show Context)
Citation Context ...kage methods such as ridge regression (Hoerl and Kennard, 1970), NN-GARROTE (Breiman, 1995), and LASSO (Tibshirani, 1996). The new approach, which adopts a methodology that resembles empirical Bayes (=-=Robbins, 1955-=-, 1964), was shown to always rival, and generally outperform, all these techniques in terms of predictive squared error. Moreover, it determines the dimensionality of the model as a natural by-product... |

42 | On Information and Suciency - Kullback, Leibler - 1951 |

40 | The empirical Bayes approach to statistical decision problems - Robbins - 1964 |

26 | The covariance inflation criterion for adaptive model selection - Tibshirani, Knight - 1999 |

24 | An algorithm for computing the nonparametric MLE of a mixing distribution - Lesperance, Kalbfleisch - 1992 |

10 | Construction of sequences estimating the mixing distribution - Deely, Kruse - 1968 |

9 |
An estimation procedure for mixtures of distributions
- Choi, Bulgren
- 1968
(Show Context)
Citation Context ...mators of an arbitrary G are available in the literature, including the MLE (e.g., Laird, 1978; Bohning et al., 1992; Lesperance & Kalbfleisch, 1992) and some minimum distance estimators (e.g. Choi & =-=Bulgren, 1968-=-; Deely & Kruse, 1968; Macdonald, 1971; Blum & Susarla, 1977; Wang, 2000). Here, consistency means that Pr( lim k## G k (#) = G(#), # any continuity point of G) = 1. (3) 2.2 Estimation for multinormal... |

6 | Estimation of a mixing distribution function - Blum, Susarla - 1977 |

5 |
Information Theory and Statistics, 2nd ed
- Kullback
- 1968
(Show Context)
Citation Context ...e two distribution functionssf and f , wheresf is an estimate of the probability density function (pdf) f(x). This measure is #KL (f, f) = E f log(f/ f) = # log(f/ f)f dx (Kullback and Leibler, 1951; =-=Kullback, 1968-=-). This is appropriate because, almost surely, # n # i=1 f(x i ) f(x i ) # 1 n # e -#KL (f,sf) as n # #, (1) where x 1 , . . . , xn are iid from f(X). The goal of the modeling process that is most int... |

4 | C.A.MANcomputer assisted analysis of mixtures: statistical algorithms. Biometrics 1992; 48 - Bohning, Schlattman, et al. |

4 |
Comment on a paper by Choi and
- Macdonald
- 1971
(Show Context)
Citation Context ...le in the literature, including the MLE (e.g., Laird, 1978; Bohning et al., 1992; Lesperance & Kalbfleisch, 1992) and some minimum distance estimators (e.g. Choi & Bulgren, 1968; Deely & Kruse, 1968; =-=Macdonald, 1971-=-; Blum & Susarla, 1977; Wang, 2000). Here, consistency means that Pr( lim k## G k (#) = G(#), # any continuity point of G) = 1. (3) 2.2 Estimation for multinormal distributions A special case of the a... |

3 | Classical Inference and the Linear Model, Volume 2A of Kendall's Advanced Theory of Statistics - Stuart, Ord - 1999 |