Results 1  10
of
15
Generating Text with Recurrent Neural Networks
"... Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessianfree optimization have been able to overcome the difficulties associated with training RNNs, making it ..."
Abstract

Cited by 67 (3 self)
 Add to MetaCart
(Show Context)
Recurrent Neural Networks (RNNs) are very powerful sequence models that do not enjoy widespread use because it is extremely difficult to train them properly. Fortunately, recent advances in Hessianfree optimization have been able to overcome the difficulties associated with training RNNs, making it possible to apply them successfully to challenging sequence problems. In this paper we demonstrate the power of RNNs trained with the new HessianFree optimizer (HF) by applying them to characterlevel language modeling tasks. The standard RNN architecture, while effective, is not ideally suited for such tasks, so we introduce a new RNN variant that uses multiplicative (or “gated”) connections which allow the current input character to determine the transition matrix from one hidden state vector to the next. After training the multiplicative RNN with the HF optimizer for five days on 8 highend Graphics Processing Units, we were able to surpass the performance of the best previous single method for characterlevel language modeling – a hierarchical nonparametric sequence model. To our knowledge this represents the largest recurrent neural network application to date. 1.
Training Recurrent Neural Networks
, 2013
"... Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging probl ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems. We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train. Next, we present a new variant of the Hessianfree (HF) optimizer and show that it can train RNNs on tasks that have extreme longrange temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to characterlevel language modelling and get excellent results. We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances. Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with longterm dependencies. This directly contradicts widespread beliefs about the inability of firstorder methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.
Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical PitmanYor Process
"... We propose a novel dependent hierarchical PitmanYor process model for discrete data. An incremental Monte Carlo inference procedure for this model is developed. We show that inference in this model can be performed in constant space and linear time. The model is demonstrated in a discrete sequence ..."
Abstract

Cited by 9 (4 self)
 Add to MetaCart
(Show Context)
We propose a novel dependent hierarchical PitmanYor process model for discrete data. An incremental Monte Carlo inference procedure for this model is developed. We show that inference in this model can be performed in constant space and linear time. The model is demonstrated in a discrete sequence prediction task where it is shown to achieve state of the art sequence prediction performance while using significantly less memory. 1.
Context Tree Switching
"... This paper describes the Context Tree Switching technique, a modification of Context Tree Weighting for the prediction of binary, stationary, nMarkov sources. By modifying Context Tree Weighting’s recursive weighting scheme, it is possible to mix over a strictly larger class of models without incre ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
This paper describes the Context Tree Switching technique, a modification of Context Tree Weighting for the prediction of binary, stationary, nMarkov sources. By modifying Context Tree Weighting’s recursive weighting scheme, it is possible to mix over a strictly larger class of models without increasing the asymptotic time or space complexity of the original algorithm. We prove that this generalization preserves the desirable theoretical properties of Context Tree Weighting on stationary nMarkov sources, and show empirically that this new technique leads to consistent improvements over Context Tree Weighting as measured on the Calgary Corpus. 1
Bayesian Variable Order Markov Models
"... We present a simple, effective generalisation of variable order Markov models to full online Bayesian estimation. The mechanism used is close to that employed in context tree weighting. The main contribution is the addition of a prior, conditioned on context, on the Markov order. The resulting const ..."
Abstract

Cited by 6 (5 self)
 Add to MetaCart
(Show Context)
We present a simple, effective generalisation of variable order Markov models to full online Bayesian estimation. The mechanism used is close to that employed in context tree weighting. The main contribution is the addition of a prior, conditioned on context, on the Markov order. The resulting construction uses a simple recursion and can be updated efficiently. This allows the model to make predictions using more complex contexts, as more data is acquired, if necessary. In addition, our model can be alternatively seen as a mixture of tree experts. Experimental results show that the predictive model exhibits consistently good performance in a variety of domains. We consider Bayesian estimation of variable order Markov models (see Begleiter et al., 2004, for an overview). Such models create a tree of partitions, where the disjoint sets of every partition correspond to different contexts. We can associate a submodel or expert with each context in order to make predictions. The main contribution of this paper is a conditional prior on the Markov order—or equivalently the context depth. This is based on a recursive construction that estimates, for each context at a certain depth k, whether it makes better predictions than the predictions of contexts at depths smaller than k. This simple model defines a mixture of variable order Marko models and its parameters can be updated in closed form in time O (D) for trees of depth D with each new observation. For unbounded length contexts, the complexity of the algorithm is O ( T 2) for an input sequence of length T. Furthermore, it exhibits robust performance in a variety of tasks. Finally, the model is easily extensible to controlled processes.
Neural Probabilistic Language Model for System Combination
"... This paper gives the system description of the neural probabilistic language modeling (NPLM) team of Dublin City University for our participation in the system combination task in the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT (ML4HMT12). ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
This paper gives the system description of the neural probabilistic language modeling (NPLM) team of Dublin City University for our participation in the system combination task in the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT (ML4HMT12). We used the information obtained by NPLM as meta information to the system combination module. For the SpanishEnglish data, our paraphrasing approach achieved 25.81 BLEU points, which lost 0.19 BLEU points absolute compared to the standard confusion networkbased system combination. We note that our current usage of NPLM is very limited due to the difficulty in combining NPLM and system combination.
Skip Context Tree Switching
"... Context Tree Weighting is a powerful probabilistic sequence prediction technique that efficiently performs Bayesian model averaging over the class of all prediction suffix trees of bounded depth. In this paper we show how to generalize this technique to the class of Kskip prediction suffix trees. ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Context Tree Weighting is a powerful probabilistic sequence prediction technique that efficiently performs Bayesian model averaging over the class of all prediction suffix trees of bounded depth. In this paper we show how to generalize this technique to the class of Kskip prediction suffix trees. Contrary to regular prediction suffix trees,Kskip prediction suffix trees are permitted to ignore up toK contiguous portions of the context. This allows for significant improvements in predictive accuracy when irrelevant variables are present, a case which often occurs within recordaligned data and images. We provide a regretbased analysis of our approach, and empirically evaluate it on the Calgary corpus and a set of Atari 2600 screen prediction tasks. 1.
Improvements to the Sequence Memoizer
"... The sequence memoizer is a model for sequence data with stateoftheart performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memoryefficient representation, and inference algori ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
The sequence memoizer is a model for sequence data with stateoftheart performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memoryefficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the “mysterious ” coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements. 1
A Machine Learning Perspective on Predictive Coding with PAQ8 and New Applications
, 2011
"... The goal of this thesis is to describe a stateoftheart compression method called PAQ8 from the perspective of machine learning. We show both how PAQ8 makes use of several simple, well known machine learning models and algorithms, and how it can be improved by exchanging these components for more ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The goal of this thesis is to describe a stateoftheart compression method called PAQ8 from the perspective of machine learning. We show both how PAQ8 makes use of several simple, well known machine learning models and algorithms, and how it can be improved by exchanging these components for more sophisticated models and algorithms. We also present a broad range of new applications of PAQ8 to machine learning tasks including language modeling and adaptive text prediction, adaptive game playing, classification, and lossy compression using features from the field of deep learning. ii Table of Contents Abstract....................................
Deplump for streaming data
 In Data Compression Conference
, 2011
"... We present a generalpurpose, lossless compressor for streaming data. This compressor is based on the deplump probabilistic compressor for batch data. Approximations to the inference procedure used in the probabilistic model underpinning deplump are introduced that yield the computational asyptotics ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We present a generalpurpose, lossless compressor for streaming data. This compressor is based on the deplump probabilistic compressor for batch data. Approximations to the inference procedure used in the probabilistic model underpinning deplump are introduced that yield the computational asyptotics necessary for stream compression. We demonstrate the performance of this streaming deplump variant relative to the batch compressor on a benchmark corpus and find that it performs equivalently well despite these approximations. We also explore the performance of the streaming variant on corpora that are too large to be compressed by batch deplump and demonstrate excellent compression performance. 1