## Quick Training of Probabilistic Neural Nets by Importance Sampling (2003)

Citations: | 11 - 5 self |

### BibTeX

@MISC{Bengio03quicktraining,

author = {Yoshua Bengio and Jean-Sébastien Senécal},

title = {Quick Training of Probabilistic Neural Nets by Importance Sampling},

year = {2003}

}

### Years of Citing Articles

### OpenURL

### Abstract

Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the vocabulary. Inspired by the contrastive divergence model, we propose and evaluate sampling-based methods which require network passes only for the observed "positive example" and a few sampled negative example words. A very significant speed-up is obtained with an adaptive importance sampling.

### Citations

1083 | A maximum entropy approach to natural language processing - Berger, Pietra, et al. - 1996 |

953 | Head-Driven Statistical Models for Natural Language Parsing
- Collins
- 1999
(Show Context)
Citation Context ...se prevent us from capturing some dependencies that we know to matter. This well-known problem plagues n-grams (Katz, 1987; Jelinek and Mercer, 1980), lexicalized stochastic grammars (Charniak, 1999; =-=Collins, 1999-=-; Chelba and Jelinek, 2000), and many other probabilistic models (e.g. graphical models with dense cycles, in general, see (Jordan, 1998)), and it is well described, with many examples and analyzes, i... |

822 | A Maximum-Entropy-Inspired Parser
- Charniak
(Show Context)
Citation Context ...mptions, but these prevent us from capturing some dependencies that we know to matter. This well-known problem plagues n-grams (Katz, 1987; Jelinek and Mercer, 1980), lexicalized stochastic grammars (=-=Charniak, 1999-=-; Collins, 1999; Chelba and Jelinek, 2000), and many other probabilistic models (e.g. graphical models with dense cycles, in general, see (Jordan, 1998)), and it is well described, with many examples ... |

664 | Slava M. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer
- Katz
- 1987
(Show Context)
Citation Context ...of words. This can be avoided with drastic conditional independence assumptions, but these prevent us from capturing some dependencies that we know to matter. This well-known problem plagues n-grams (=-=Katz, 1987-=-; Jelinek and Mercer, 1980), lexicalized stochastic grammars (Charniak, 1999; Collins, 1999; Chelba and Jelinek, 2000), and many other probabilistic models (e.g. graphical models with dense cycles, in... |

611 | Learning in Graphical Models - Jordan, editor - 1999 |

508 | Training products of experts by minimizing contrastive divergence
- Hinton
- 2002
(Show Context)
Citation Context ...ty of the next word, the computation of the partition function is feasible (only grows linearly with the vocabulary size) but is still quite expensive. Hinton has proposed the contrastive divergence (=-=Hinton, 2002-=-) approach to approximate the gradient of the log-likelihood with respect to these parameters, in the unsupervised case, to make learning computationally feasible. Contrastive Divergence is based on a... |

336 | Interpolated estimation of markov source parameters from sparse data - Jelinek, Mercer - 1980 |

179 |
Combining probability distributions: a critique and an annotated bibliography
- Genest, Zidek
- 1986
(Show Context)
Citation Context ...mate the gradient. In this paper, we propose several new variants of that key idea. 1.4 Products vs Sums of Probabilities Like previous probabilistic models (e.g. weighted averages in the log-domain (=-=Genest and Zideck, 1986-=-; Heskes, 1998)), the Maximum Entropy model (Berger, Della Pietra and Della Pietra, 1996) and the neural models described above, can be interpreted as energybased models that correspond to normalized ... |

173 |
Learning distributed representations of concepts
- Hinton
- 1986
(Show Context)
Citation Context .... 1.2 Distributed Representations for High-Order Dependencies The idea of using distributed representations has been one of the early contributions of the connectionist researchers of the 80’s (c.f. (=-=Hinton, 1986-=-)). The hidden units of an artificial neural network encode information in a very efficient way; as an example, n binary neurons can represent 2n different objects. More importantly, these hidden unit... |

146 | A Neural Probabilistic Language Model - Bengio, Ducharme, et al. |

34 | Bias/variance decompositions for likelihood-based estimators
- Heskes
- 1998
(Show Context)
Citation Context ...s paper, we propose several new variants of that key idea. 1.4 Products vs Sums of Probabilities Like previous probabilistic models (e.g. weighted averages in the log-domain (Genest and Zideck, 1986; =-=Heskes, 1998-=-)), the Maximum Entropy model (Berger, Della Pietra and Della Pietra, 1996) and the neural models described above, can be interpreted as energybased models that correspond to normalized products, as i... |

27 | User-Friendly Text Prediction for Translators
- Foster, Langlais, et al.
- 2002
(Show Context)
Citation Context ...rts have been carried out in different instances, and they suggest that products of experts can yield significant improvements in terms of out-of-sample likelihood. For example, Foster’s experiments (=-=Foster, 2002-=-) confront head-to-head a normalized product of probabilities (implemented by a Maximum Entropy model) with a weighted sum of probabilities. In this case the application is to statistical translation;... |

10 | Taking on the curse of dimensionality in joint distributions using neural networks
- Bengio, Bengio
(Show Context)
Citation Context ...more computation because the objective function is not convex and can in fact be quite complex. These ideas have been exploited to learn the probability function of high-dimensional discrete data in (=-=Bengio and Bengio, 2000-=-), where comparisons have been made with polynomial learners and table-based graphical models. In terms of representation, distributed models (e.g. neural networks, distributed representation graphica... |

1 |
Structured language modelin
- Chelba, Jelinek
- 2000
(Show Context)
Citation Context ...rom capturing some dependencies that we know to matter. This well-known problem plagues n-grams (Katz, 1987; Jelinek and Mercer, 1980), lexicalized stochastic grammars (Charniak, 1999; Collins, 1999; =-=Chelba and Jelinek, 2000-=-), and many other probabilistic models (e.g. graphical models with dense cycles, in general, see (Jordan, 1998)), and it is well described, with many examples and analyzes, in the book (Manning and Sc... |

1 |
A theoretical framework for sequential importance sampling and resampling
- Logvinenko
- 2001
(Show Context)
Citation Context ...adually increased according to a diagnostic. It uses Algorithm 3 in the inner loop. The measure used for the diagnostic is known as the “effective sample size” S of the importance sampling estimator (=-=Logvinenko, 2001-=-): S = (∑ N rj) 2 j=1 ∑N j=1 r2 j where rj is the importance sampling ratio for the j-th sample, rj ), which is estimated here = P (w ′ j )/Q(w′ j e with rj ≈ −E(w′ j ,ht) /Q(w ′ j |ht) ∑ w ′ ∈J e−E(w... |