Results 11  20
of
132
A fast and simple algorithm for training neural probabilistic language models
 In Proceedings of the International Conference on Machine Learning
, 2012
"... models ..."
Decoding with LargeScale Neural Language Models Improves Translation
"... We explore the application of neural language models to machine translation. We develop a new model that combines the neural probabilistic language model of Bengio et al., rectified linear units, and noisecontrastive estimation, and we incorporate it into a machine translation system both by rerank ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
We explore the application of neural language models to machine translation. We develop a new model that combines the neural probabilistic language model of Bengio et al., rectified linear units, and noisecontrastive estimation, and we incorporate it into a machine translation system both by reranking kbest lists and by direct integration into the decoder. Our largescale, largevocabulary experiments across four language pairs show that our neural language model improves translation quality by up to 1.1 Bleu. 1
A Stochastic Memoizer for Sequence Data
"... We propose an unboundeddepth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes ..."
Abstract

Cited by 25 (7 self)
 Add to MetaCart
(Show Context)
We propose an unboundeddepth, hierarchical, Bayesian nonparametric model for discrete sequence data. This model can be estimated from a single training sequence, yet shares statistical strength between subsequent symbol predictive distributions in such a way that predictive performance generalizes well. The model builds on a specific parameterization of an unboundeddepth hierarchical PitmanYor process. We introduce analytic marginalization steps (using coagulation operators) to reduce this model to one that can be represented in time and space linear in the length of the training sequence. We show how to perform inference in such a model without truncation approximation and introduce fragmentation operators necessary to do predictive inference. We demonstrate the sequence memoizer by using it as a language model, achieving stateoftheart results. 1.
Dependencybased word embeddings
 In ACL
, 2014
"... While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skipgram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependencybase ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
(Show Context)
While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skipgram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependencybased contexts, and show that they produce markedly different embeddings. The dependencybased embeddings are less topical and exhibit more functional similarity than the original skipgram embeddings. 1
2014b. Multilingual Models for Compositional Distributional Semantics
 In Proceedings of ACL
"... We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and jointspace embeddings. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining suf ..."
Abstract

Cited by 21 (1 self)
 Add to MetaCart
(Show Context)
We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and jointspace embeddings. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences. The models do not rely on word alignments or any syntactic information and are successfully applied to a number of diverse languages. We extend our approach to learn semantic representations at the document level, too. We evaluate these models on two crosslingual document classification tasks, outperforming the prior state of the art. Through qualitative analysis and the study of pivoting effects we demonstrate that our representations are semantically plausible and can capture semantic relationships across languages without parallel data. 1
Conditional Probability Tree Estimation Analysis and Algorithms
"... 1.1 Main Results We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that s ..."
Abstract

Cited by 20 (3 self)
 Add to MetaCart
(Show Context)
1.1 Main Results We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which provably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works succesfully on a dataset with roughly 10 6 labels. 1
A neural autoregressive topic model
 In Advances in Neural Information Processing Systems 25
, 2012
"... We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more mea ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional meanfield recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1
Exploiting similarities among languages for machine translation
, 2013
"... Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language struc ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90 %
pruned or continuous space language models on a gpu for statistical machine translation
 In Proceedings of NAACLHLT 2012 Workshop: Will We Ever Really Replace the Ngram Model? On the Future of Language Modeling for HLT
"... Language models play an important role in large vocabulary speech recognition and statistical machine translation systems. The dominant approach since several decades are backoff language models. Some years ago, there was a clear tendency to build huge language models trained on hundreds of billion ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
Language models play an important role in large vocabulary speech recognition and statistical machine translation systems. The dominant approach since several decades are backoff language models. Some years ago, there was a clear tendency to build huge language models trained on hundreds of billions of words. Lately, this tendency has changed and recent works concentrate on data selection. Continuous space methods are a very competitive approach, but they have a high computational complexity and are not yet in widespread use. This paper presents an experimental comparison of all these approaches on a large statistical machine translation task. We also describe an opensource implementation to train and use continuous space language models (CSLM) for such large tasks. We describe an efficient implementation of the CSLM using graphical processing units from Nvidia. By these means, we are able to train an CSLM on more than 500 million words in 20 hours. This CSLM provides an improvement of up to 1.8 BLEU points with respect to the best backoff language model that we were able to build. 1
Deep neural network language models
 In Proceedings of NAACLHLT 2012 Workshop: Will We Ever Really Replace the Ngram Model? On the Future of Language Modeling for HLT
, 2012
"... In recent years, neural network language models (NNLMs) have shown success in both peplexity and word error rate (WER) compared to conventional ngram language models. Most NNLMs are trained with one hidden layer. Deep neural networks (DNNs) with more hidden layers have been shown to capture higher ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
In recent years, neural network language models (NNLMs) have shown success in both peplexity and word error rate (WER) compared to conventional ngram language models. Most NNLMs are trained with one hidden layer. Deep neural networks (DNNs) with more hidden layers have been shown to capture higherlevel discriminative information about input features, and thus produce better networks. Motivated by the success of DNNs in acoustic modeling, we explore deep neural network language models (DNN LMs) in this paper. Results on a Wall Street Journal (WSJ) task demonstrate that DNN LMs offer improvements over a single hidden layer NNLM. Furthermore, our preliminary results are competitive with a model M language model, considered to be one of the current stateoftheart techniques for language modeling. 1