Results 1  10
of
60
Using Fast Weights to Improve Persistent Contrastive Divergence
"... The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed th ..."
Abstract

Cited by 37 (13 self)
 Add to MetaCart
(Show Context)
The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed that better learning can be achieved by estimating the model’s statistics using a small set of persistent ”fantasy particles ” that are not reinitialized to data points after each weight update. With sufficiently small weight updates, the fantasy particles represent the equilibrium distribution accurately but to explain why the method works with much larger weight updates it is necessary to consider the interaction between the weight updates and the Markov chain. We show that the weight updates force the Markov chain to mix fast, and using this insight we develop an even faster mixing chain that uses an auxiliary set of ”fast weights ” to implement a temporary overlay on the energy landscape. The fast weights learn rapidly but also decay rapidly and do not contribute to the normal energy landscape that defines the model. 1.
3d object recognition with deep belief nets
 Advances in Neural Information Processing Systems 22
, 2009
"... We introduce a new type of toplevel model for Deep Belief Nets and evaluate it on a 3D object recognition task. The toplevel model is a thirdorder Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB d ..."
Abstract

Cited by 36 (9 self)
 Add to MetaCart
(Show Context)
We introduce a new type of toplevel model for Deep Belief Nets and evaluate it on a 3D object recognition task. The toplevel model is a thirdorder Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalizeduniform version), which contains stereopair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5 % error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has builtin knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semisupervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2 % error. 1
TaskDriven Dictionary Learning
"... Abstract—Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a largescale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in largescale settings, and is well suited to supervised and semisupervised classification, as well as regression tasks for data that admit sparse representations. Index Terms—Basis pursuit, Lasso, dictionary learning, matrix factorization, semisupervised learning, compressed sensing. Ç 1
Stacks of Convolutional Restricted Boltzmann Machines for ShiftInvariant Feature Learning
"... In this paper we present a method for learning classspecific features for recognition. Recently a greedy layerwise procedure was proposed to initialize weights of deep belief networks, by viewing each layer as a separate Restricted Boltzmann Machine (RBM). We develop the Convolutional RBM (CRBM), a ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
In this paper we present a method for learning classspecific features for recognition. Recently a greedy layerwise procedure was proposed to initialize weights of deep belief networks, by viewing each layer as a separate Restricted Boltzmann Machine (RBM). We develop the Convolutional RBM (CRBM), a variant of the RBM model in which weights are shared to respect the spatial structure of images. This framework learns a set of features that can generate the images of a specific object class. Our feature extraction model is a four layer hierarchy of alternating filtering and maximum subsampling. We learn feature parameters of the first and third layers viewing them as separate CRBMs. The outputs of our feature extraction hierarchy are then fed as input to a discriminative classifier. It is experimentally demonstrated that the extracted features are effective for object detection, using them to obtain performance comparable to the stateoftheart on handwritten digit recognition and pedestrian detection. 1.
On Optimization Methods for Deep Learning
"... The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated offtheshelf optimization methods such as Limited memory BFGS (LBFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between LBFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of LBFGS with locally connected networks and convolutional neural networks. Using LBFGS, our convolutional network model achieves 0.69 % on the standard MNIST dataset. This is a stateoftheart result on MNIST among algorithms that do not use distortions or pretraining. 1.
Representation Learning: A Review and New Perspectives
, 2012
"... The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representationlearning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and joint training of deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep architectures. This motivates longerterm unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
On Deep Generative Models with Applications to Recognition
"... The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using wellengineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using wellengineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic models directly on the raw pixel values has proved to be much more difficult and is typically only used for regularizing discriminative methods. In this work, we use one of the best, pixellevel, generative models of natural images – a gated MRF – as the lowest level of a deep belief network (DBN) that has several hidden layers. We show that the resulting DBN is very good at coping with occlusion when predicting expression categories from face images, and it can produce features that perform comparably to SIFT descriptors for discriminating different types of scene. The generative ability of the model also makes it easy to see what information is captured and what is lost at each level of representation. 1. Introduction and Previous
Learning to combine foveal glimpses with a thirdorder Boltzmann machine
"... We describe a model based on a Boltzmann machine with thirdorder connections that can learn how to accumulate information about a shape over several fixations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
We describe a model based on a Boltzmann machine with thirdorder connections that can learn how to accumulate information about a shape over several fixations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of fixations and it must combine the “glimpse ” at each fixation with the location of the fixation before integrating the information with information from other glimpses of the same object. We evaluate this model on a synthetic dataset and two image classification datasets, showing that it can perform at least as well as a model trained on whole images. 1
Conditional Restricted Boltzmann Machines for Structured Output Prediction
"... Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training nonconditional RBMs, these ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training nonconditional RBMs, these algorithms are not applicable to conditional models and there has been almost no work on training and generating predictions from conditional RBMs for structured output problems. We first argue that standard Contrastive Divergencebased learning may not be suitable for training CRBMs. We then identify two distinct types of structured output prediction problems and propose an improved learning algorithm for each. The first problem type is one where the output space has arbitrary structure but the set of likely output configurations is relatively small, such as in multilabel classification. The second problem is one where the output space is arbitrarily structured but where the output space variability is much greater, such as in image denoising or pixel labeling. We show that the new learning algorithms can work much better than Contrastive Divergence on both types of problems. 1
The Neural Autoregressive Distribution Estimator
 In AISTATS’2011
, 2011
"... We describe a new approach for modeling the distribution of highdimensional vectors of discrete variables. This model is inspired by the restricted Boltzmann machine (RBM), which has been shown to be a powerful model of such distributions. However, an RBM typically does not provide a tractable dist ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
We describe a new approach for modeling the distribution of highdimensional vectors of discrete variables. This model is inspired by the restricted Boltzmann machine (RBM), which has been shown to be a powerful model of such distributions. However, an RBM typically does not provide a tractable distribution estimator, since evaluating the probability it assigns to some given observation requires the computation of the socalled partition function, which itself is intractable for RBMs of even moderate size. Our model circumvents this difficulty by decomposing the joint distribution of observations into tractable conditional distributions and modeling each conditional using a nonlinear function similar to a conditional of an RBM. Our model can also be interpreted as an autoencoder wired such that its output can be used to assign valid probabilities to observations. We show that this new model outperforms other multivariate binary distribution estimators on several datasets and performs similarly to a large (but intractable) RBM. 1