Results 11  20
of
113
Dependencybased word embeddings
 In ACL
, 2014
"... While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skipgram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependencybase ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
While continuous word embeddings are gaining popularity, current models are based solely on linear contexts. In this work, we generalize the skipgram model with negative sampling introduced by Mikolov et al. to include arbitrary contexts. In particular, we perform experiments with dependencybased contexts, and show that they produce markedly different embeddings. The dependencybased embeddings are less topical and exhibit more functional similarity than the original skipgram embeddings. 1
2014b. Multilingual Models for Compositional Distributional Semantics
 In Proceedings of ACL
"... We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and jointspace embeddings. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining suf ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and jointspace embeddings. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences. The models do not rely on word alignments or any syntactic information and are successfully applied to a number of diverse languages. We extend our approach to learn semantic representations at the document level, too. We evaluate these models on two crosslingual document classification tasks, outperforming the prior state of the art. Through qualitative analysis and the study of pivoting effects we demonstrate that our representations are semantically plausible and can capture semantic relationships across languages without parallel data. 1
Conditional Probability Tree Estimation Analysis and Algorithms
"... 1.1 Main Results We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that s ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
1.1 Main Results We consider the problem of estimating the conditional probability of a label in time O(log n), where n is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which provably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works succesfully on a dataset with roughly 10 6 labels. 1
Deep neural network language models
 In Proceedings of NAACLHLT 2012 Workshop: Will We Ever Really Replace the Ngram Model? On the Future of Language Modeling for HLT
, 2012
"... In recent years, neural network language models (NNLMs) have shown success in both peplexity and word error rate (WER) compared to conventional ngram language models. Most NNLMs are trained with one hidden layer. Deep neural networks (DNNs) with more hidden layers have been shown to capture higher ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
In recent years, neural network language models (NNLMs) have shown success in both peplexity and word error rate (WER) compared to conventional ngram language models. Most NNLMs are trained with one hidden layer. Deep neural networks (DNNs) with more hidden layers have been shown to capture higherlevel discriminative information about input features, and thus produce better networks. Motivated by the success of DNNs in acoustic modeling, we explore deep neural network language models (DNN LMs) in this paper. Results on a Wall Street Journal (WSJ) task demonstrate that DNN LMs offer improvements over a single hidden layer NNLM. Furthermore, our preliminary results are competitive with a model M language model, considered to be one of the current stateoftheart techniques for language modeling. 1
A neural autoregressive topic model
 In Advances in Neural Information Processing Systems 25
, 2012
"... We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more mea ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
(Show Context)
We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional meanfield recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1
pruned or continuous space language models on a gpu for statistical machine translation
 In Proceedings of NAACLHLT 2012 Workshop: Will We Ever Really Replace the Ngram Model? On the Future of Language Modeling for HLT
"... Language models play an important role in large vocabulary speech recognition and statistical machine translation systems. The dominant approach since several decades are backoff language models. Some years ago, there was a clear tendency to build huge language models trained on hundreds of billion ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
Language models play an important role in large vocabulary speech recognition and statistical machine translation systems. The dominant approach since several decades are backoff language models. Some years ago, there was a clear tendency to build huge language models trained on hundreds of billions of words. Lately, this tendency has changed and recent works concentrate on data selection. Continuous space methods are a very competitive approach, but they have a high computational complexity and are not yet in widespread use. This paper presents an experimental comparison of all these approaches on a large statistical machine translation task. We also describe an opensource implementation to train and use continuous space language models (CSLM) for such large tasks. We describe an efficient implementation of the CSLM using graphical processing units from Nvidia. By these means, we are able to train an CSLM on more than 500 million words in 20 hours. This CSLM provides an improvement of up to 1.8 BLEU points with respect to the best backoff language model that we were able to build. 1
Training continuous space language models: some practical issues
"... Using multilayer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally ch ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Using multilayer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms of perplexity and on a largescale translation task. 1
Training Restricted Boltzmann Machines on Word Observations
"... The restricted Boltzmann machine (RBM) is a flexible model for complex data. However, using RBMs for highdimensional multinomial observations poses significant computational difficulties. In natural language processing applications, words are naturally modeled by Kary discrete distributions, where ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
The restricted Boltzmann machine (RBM) is a flexible model for complex data. However, using RBMs for highdimensional multinomial observations poses significant computational difficulties. In natural language processing applications, words are naturally modeled by Kary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundred thousands. The conventional approach to training RBMs on word observations is limited because it requires sampling the states of Kway softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue with a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K. We demonstrate the success of our approach by training RBMs on hundreds of millions of word ngrams using larger vocabularies than previously feasible with RBMs and by using the learned features to improve performance on chunking and sentiment classification tasks, achieving stateoftheart results on the latter. 1.
Reembedding Words
"... We present a fast method for repurposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks t ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
We present a fast method for repurposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with respect to several baselines, and observe that the approach is most useful when the training set is sufficiently small. 1