Results 1 - 10
of
11
Three New Graphical Models for Statistical Language Modelling
"... The supremacy of n-gram models in statistical language modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic language models that define the distribution of the next ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
The supremacy of n-gram models in statistical language modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic language models that define the distribution of the next word in a sequence given several preceding words by using distributed representations of those words. We show how real-valued distributed representations for words can be learned at the same time as learning a large set of stochastic binary hidden features that are used to predict the distributed representation of the next word from previous distributed representations. Adding connections from the previous states of the binary hidden features improves performance as does adding direct connections between the real-valued distributed representations. One of our models significantly outperforms the very best n-gram models. 1.
Distributional Representations for Handling Sparsity in Supervised Sequence-Labeling
"... Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
Supervised sequence-labeling systems in natural language processing often suffer from data sparsity because they use word types as features in their prediction tasks. Consequently, they have difficulty estimating parameters for types which appear in the test set, but seldom (or never) appear in the training set. We demonstrate that distributional representations of word types, trained on unannotated text, can be used to improve performance on rare words. We incorporate aspects of these representations into the feature space of our sequence-labeling systems. In an experiment on a standard chunking dataset, our best technique improves a chunker from 0.76 F1 to 0.86 F1 on chunks beginning with rare words. On the same dataset, it improves our part-of-speech tagger from 74 % to 80 % accuracy on rare words. Furthermore, our system improves significantly over a baseline system when applied to text from a different domain, and it reduces the sample complexity of sequence labeling. 1
Exploring Representation-Learning Approaches to Domain Adaptation
"... Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Sequence labeling systems like partof-speech taggers are typically trained on newswire text, and ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Sequence labeling systems like partof-speech taggers are typically trained on newswire text, and in tests their error rate on, for example, biomedical data can triple, or worse. We investigate techniques for building open-domain sequence labeling systems that approach the ideal of a system whose accuracy is high and constant across domains. In particular, we investigate unsupervised techniques for representation learning that provide new features which are stable across domains, in that they are predictive in both the training and out-of-domain test data. In experiments, our novel techniques reduce error by as much as 29 % relative to the previous state of the art on out-of-domain text. 1
Factored neural language models
- In HLT-NAACL
, 2006
"... We present a new type of neural probabilistic language model that learns a mapping from both words and explicit word features into a continuous space that is then used for word prediction. Additionally, we investigate several ways of deriving continuous word representations for unknown words from th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present a new type of neural probabilistic language model that learns a mapping from both words and explicit word features into a continuous space that is then used for word prediction. Additionally, we investigate several ways of deriving continuous word representations for unknown words from those of known words. The resulting model significantly reduces perplexity on sparse-data tasks when compared to standard backoff models, standard neural language models, and factored language models. 1
Efficient subsampling for training complex language models
- in Proceedings of EMNLP
, 2011
"... We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of bi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1 % ∼ 5 % of the standard complexity with no loss in performance. 1
Sentiment Classification Based on Supervised Latent n-gram Analysis
"... In this paper, we propose an efficient embedding for modeling higherorder (n-gram) phrases that projects the n-grams to low-dimensional latent semantic space, where a classification function can be defined. We utilize a deep neural network to build a unified discriminative framework that allows for ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper, we propose an efficient embedding for modeling higherorder (n-gram) phrases that projects the n-grams to low-dimensional latent semantic space, where a classification function can be defined. We utilize a deep neural network to build a unified discriminative framework that allows for estimating the parameters of the latent space as well as the classification function with a bias for the target classification task at hand. We apply the framework to large-scale sentimental classification task. We present comparative evaluation of the proposed method on two (large) benchmark data sets for online product reviews. The proposed method achieves superior performance in comparison to the state of the art.
AProbabilisticModel for SemanticWord Vectors
"... Vector representations of words capture relationships in words ’ functions and meanings. Many existing techniques for inducing such representations from data use a pipeline of hand-coded processing techniques. Neural language models offerprincipled techniques tolearnwordvectors usingaprobabilisticmo ..."
Abstract
- Add to MetaCart
Vector representations of words capture relationships in words ’ functions and meanings. Many existing techniques for inducing such representations from data use a pipeline of hand-coded processing techniques. Neural language models offerprincipled techniques tolearnwordvectors usingaprobabilisticmodeling approach. However, learning word vectors via language modeling produces representationswithasyntacticfocus,wherewordsimilarityisbaseduponhowwords are used in sentences. In this work we wish to learn word representations to encodewordmeaning–semantics. Weintroduceamodelwhichlearnssemantically focused word vectors using a probabilistic model of documents. We evaluate the model’s word vectors intwotasks ofsentiment analysis. 1
Time Series Modeling with Hidden Variables and Gradient-Based Algorithms
, 2011
"... who laid the foundations for this research iii Acknowledgements These past five and a half years of doctoral studies at New York University have constituted a personally transformative experience (and I am claiming this independently of the jazz clubs, concert halls and vibrant community populating ..."
Abstract
- Add to MetaCart
who laid the foundations for this research iii Acknowledgements These past five and a half years of doctoral studies at New York University have constituted a personally transformative experience (and I am claiming this independently of the jazz clubs, concert halls and vibrant community populating the greater Greenwich Village area). During these years, I have benefited from countless contributions that are impossible to acknowledge in a few lines. I will limit myself to mentioning a few individuals who directly enabled this work, hoping to eventually have the opportunity to contribute to someone else’s development in return. I would like to immensely thank my adviser, Prof. Yann LeCun, for providing me with resources, guidance, and freedom to pursue my research. Merci beaucoup pour avoir cru en moi, Yann. Yann LeCun’s lab is an intellectual hub with connections far beyond the field of Machine Learning, and therefore a very exciting research environment.
Strategies for Training Large Scale Neural Network Language Models
"... Abstract—We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, ..."
Abstract
- Add to MetaCart
Abstract—We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10 % relative reduction of word error rate on English Broadcast News speech recognition task, against large 4-gram model trained on 400M tokens. I.

