Results 1  10
of
277
Efficient estimation of word representations in vector space
, 2013
"... We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types ..."
Abstract

Cited by 311 (6 self)
 Add to MetaCart
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide stateoftheart performance on our test set for measuring syntactic and semantic word similarities.
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
"... Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of compo ..."
Abstract

Cited by 166 (7 self)
 Add to MetaCart
(Show Context)
Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80 % up to 85.4%. The accuracy of predicting finegrained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7 % over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases. 1
Dual averaging methods for regularized stochastic learning and online optimization
 In Advances in Neural Information Processing Systems 23
, 2009
"... We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1norm for promoting sparsity. We develop extensions of Nes ..."
Abstract

Cited by 131 (7 self)
 Add to MetaCart
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.
GloVe: Global Vectors for Word Representation
"... Recent methods for learning vector space representations of words have succeeded in capturing finegrained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regular ..."
Abstract

Cited by 99 (9 self)
 Add to MetaCart
(Show Context)
Recent methods for learning vector space representations of words have succeeded in capturing finegrained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a wordword cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75 % on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition. 1
Parsing with Compositional Vector Grammars
"... Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only par ..."
Abstract

Cited by 96 (4 self)
 Add to MetaCart
(Show Context)
Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntacticosemantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8 % to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20 % faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments. 1
Large Scale Distributed Deep Networks
"... Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a ..."
Abstract

Cited by 92 (11 self)
 Add to MetaCart
(Show Context)
Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for largescale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of LBFGS. Downpour SGD and Sandblaster LBFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves stateoftheart performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradientbased machine learning algorithm. 1
Parallelized stochastic gradient descent
 Advances in Neural Information Processing Systems 23
, 2010
"... With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms ..."
Abstract

Cited by 85 (3 self)
 Add to MetaCart
With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime [1, 8]. 1
Adaptive Regularization of Weight Vectors
 Advances in Neural Information Processing Systems 22
, 2009
"... We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform ..."
Abstract

Cited by 65 (14 self)
 Add to MetaCart
We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidenceweighted online learning techniques and show empirically that AROW achieves stateoftheart performance and notable robustness in the case of nonseparable data. 1