## Hierarchical distributed representations for statistical language modeling (2004)

### Cached

### Download Links

- [www.cis.upenn.edu]
- [books.nips.cc]
- [www.cs.ucsd.edu]
- [www.seas.upenn.edu]
- [cseweb.ucsd.edu]
- [john.blitzer.com]
- [john.blitzer.com]
- DBLP

### Other Repositories/Bibliography

Venue: | In Advances in Neural Information Processing Systems 17 |

Citations: | 13 - 5 self |

### BibTeX

@INPROCEEDINGS{Blitzer04hierarchicaldistributed,

author = {John Blitzer and Kilian Q. Weinberger and Lawrence K. Saul and O C. N. Pereira},

title = {Hierarchical distributed representations for statistical language modeling},

booktitle = {In Advances in Neural Information Processing Systems 17},

year = {2004},

publisher = {MIT Press}

}

### OpenURL

### Abstract

Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

### Citations

1287 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1993
(Show Context)
Citation Context ...ion Statistical language models are essential components of natural language systems for human-computer interaction. They play a central role in automatic speech recognition [11], machine translation =-=[5]-=-, statistical parsing [8], and information retrieval [15]. These models estimate the probability that a word will occur in a given context, where in general a context specifies a relationship to one o... |

938 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1996
(Show Context)
Citation Context ...oothing methods are based on simple back-off formulas or interpolation schemes that discount the probability of observed events and assign the “leftover” probability mass to events unseen in training =-=[7]-=-. Unfortunately, these methods do not typically represent or take advantage of statistical regularitiessamong contexts. One expects the probabilities of rare or unseen events in one context to be rela... |

797 |
Statistical methods for speech recognition
- Jelinek
- 1997
(Show Context)
Citation Context ...word contexts. 1 Introduction Statistical language models are essential components of natural language systems for human-computer interaction. They play a central role in automatic speech recognition =-=[11]-=-, machine translation [5], statistical parsing [8], and information retrieval [15]. These models estimate the probability that a word will occur in a given context, where in general a context specifie... |

764 | Hierarchical mixtures of experts and the em algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...vised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words =-=[12]-=-. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly ... |

755 | A study of smoothing methods for language models applied to ad hoc information retrieval - Zhai, Lafferty - 2001 |

738 | Classbased n-gram models of natural language
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context ...text grows exponentially with the context length. The neural probabilistic language model (NPLM) of Bengio et al. [2, 3] achieved significant improvements over state-of-the-art smoothed n-gram models =-=[6]-=-. The NPLM encodes contexts as low-dimensional continuous vectors. These are fed to a multilayer neural network that outputs a probability distribution over words. The low-dimensional vectors and the ... |

529 | Three generative, lexicalised models for statistical parsing
- Collins
- 1997
(Show Context)
Citation Context ...models are essential components of natural language systems for human-computer interaction. They play a central role in automatic speech recognition [11], machine translation [5], statistical parsing =-=[8]-=-, and information retrieval [15]. These models estimate the probability that a word will occur in a given context, where in general a context specifies a relationship to one or more words that have al... |

165 | A neural probabilistic language model
- Bengio, Ducharme, et al.
- 2003
(Show Context)
Citation Context ... where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. =-=[2, 3]-=-, our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabul... |

153 | A C library for semidefinite programming
- CSDP
- 1999
(Show Context)
Citation Context ... The eigenvalues, shown normalized by their sum, measure the relative variance captured by individual dimensions. The optimization is convex, and its global maximum can be computed in polynomial time =-=[4]-=-. The optimization here differs slightly from the one used by Weinberger et al. [14] in that here we only preserve local distances, as opposed to local distances and angles. After computing the matrix... |

123 | Learning a kernel matrix for nonlinear dimensionality reduction
- Weinberger, Sha, et al.
- 2004
(Show Context)
Citation Context ...s of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction =-=[14]-=-, then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired b... |

116 | A kernel view of the dimensionality reduction of manifolds
- Ham, Lee, et al.
- 2004
(Show Context)
Citation Context ...e local distances, as opposed to local distances and angles. After computing the matrix Dij by semidefinite programming, a low dimensional embedding �xi is obtained by metric multidimensional scaling =-=[1, 9, 14]-=-. The top eigenvalues of the Gram matrix measure the variance captured by the leading dimensions of this embedding. Thus, one can compare the eigenvalue spectra from this method and PCA to ascertain i... |

81 | Aggregate and mixed-order Markov models for statistical language processing
- Saul, Pereira
- 1997
(Show Context)
Citation Context ...on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models =-=[10, 13]-=-. We also discuss extensions of our approach to longer multiword contexts. 1 Introduction Statistical language models are essential components of natural language systems for human-computer interactio... |

69 | Solving Euclidean distance matrix completion problems via semide£nite programming
- Alfakih, Khandani, et al.
- 1999
(Show Context)
Citation Context ... to this problem based on semidefinite programming [14]. Let �xi denote the image of �pi under this mapping. The mapping is discovered by first learning the V ×V matrix of Euclidean squared distances =-=[1]-=- given by Dij = |�xi − �xj| 2 . This is done by balancing two competing goals: (i) to co-locate semantically similar words, and (ii) to separate semantically dissimilar words. The first goal is achiev... |

3 |
Statistical models for co-occurrence and histogram data
- Hofmann, Puzicha
- 1998
(Show Context)
Citation Context ...on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models =-=[10, 13]-=-. We also discuss extensions of our approach to longer multiword contexts. 1 Introduction Statistical language models are essential components of natural language systems for human-computer interactio... |