Results 1  10
of
99
Representation learning: A review and new perspectives.
 of IEEE Conf. Comp. Vision Pattern Recog. (CVPR),
, 2005
"... AbstractThe success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can b ..."
Abstract

Cited by 173 (4 self)
 Add to MetaCart
(Show Context)
AbstractThe success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representationlearning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
TaskDriven Dictionary Learning
"... Abstract—Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that ..."
Abstract

Cited by 86 (3 self)
 Add to MetaCart
(Show Context)
Abstract—Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience, and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a largescale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in largescale settings, and is well suited to supervised and semisupervised classification, as well as regression tasks for data that admit sparse representations. Index Terms—Basis pursuit, Lasso, dictionary learning, matrix factorization, semisupervised learning, compressed sensing. Ç 1
Using Fast Weights to Improve Persistent Contrastive Divergence
"... The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed th ..."
Abstract

Cited by 64 (14 self)
 Add to MetaCart
(Show Context)
The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed that better learning can be achieved by estimating the model’s statistics using a small set of persistent ”fantasy particles ” that are not reinitialized to data points after each weight update. With sufficiently small weight updates, the fantasy particles represent the equilibrium distribution accurately but to explain why the method works with much larger weight updates it is necessary to consider the interaction between the weight updates and the Markov chain. We show that the weight updates force the Markov chain to mix fast, and using this insight we develop an even faster mixing chain that uses an auxiliary set of ”fast weights ” to implement a temporary overlay on the energy landscape. The fast weights learn rapidly but also decay rapidly and do not contribute to the normal energy landscape that defines the model. 1.
3d object recognition with deep belief nets
 Advances in Neural Information Processing Systems 22
, 2009
"... We introduce a new type of toplevel model for Deep Belief Nets and evaluate it on a 3D object recognition task. The toplevel model is a thirdorder Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB d ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
(Show Context)
We introduce a new type of toplevel model for Deep Belief Nets and evaluate it on a 3D object recognition task. The toplevel model is a thirdorder Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalizeduniform version), which contains stereopair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5 % error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has builtin knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semisupervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2 % error. 1
On Optimization Methods for Deep Learning
"... The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. ..."
Abstract

Cited by 54 (5 self)
 Add to MetaCart
(Show Context)
The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated offtheshelf optimization methods such as Limited memory BFGS (LBFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between LBFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of LBFGS with locally connected networks and convolutional neural networks. Using LBFGS, our convolutional network model achieves 0.69 % on the standard MNIST dataset. This is a stateoftheart result on MNIST among algorithms that do not use distortions or pretraining. 1.
The Neural Autoregressive Distribution Estimator
 In AISTATS’2011
, 2011
"... We describe a new approach for modeling the distribution of highdimensional vectors of discrete variables. This model is inspired by the restricted Boltzmann machine (RBM), which has been shown to be a powerful model of such distributions. However, an RBM typically does not provide a tractable dist ..."
Abstract

Cited by 44 (6 self)
 Add to MetaCart
(Show Context)
We describe a new approach for modeling the distribution of highdimensional vectors of discrete variables. This model is inspired by the restricted Boltzmann machine (RBM), which has been shown to be a powerful model of such distributions. However, an RBM typically does not provide a tractable distribution estimator, since evaluating the probability it assigns to some given observation requires the computation of the socalled partition function, which itself is intractable for RBMs of even moderate size. Our model circumvents this difficulty by decomposing the joint distribution of observations into tractable conditional distributions and modeling each conditional using a nonlinear function similar to a conditional of an RBM. Our model can also be interpreted as an autoencoder wired such that its output can be used to assign valid probabilities to observations. We show that this new model outperforms other multivariate binary distribution estimators on several datasets and performs similarly to a large (but intractable) RBM. 1
On Deep Generative Models with Applications to Recognition
"... The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using wellengineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic ..."
Abstract

Cited by 37 (2 self)
 Add to MetaCart
(Show Context)
The most popular way to use probabilistic models in vision is first to extract some descriptors of small image patches or object parts using wellengineered features, and then to use statistical learning tools to model the dependencies among these features and eventual labels. Learning probabilistic models directly on the raw pixel values has proved to be much more difficult and is typically only used for regularizing discriminative methods. In this work, we use one of the best, pixellevel, generative models of natural images – a gated MRF – as the lowest level of a deep belief network (DBN) that has several hidden layers. We show that the resulting DBN is very good at coping with occlusion when predicting expression categories from face images, and it can produce features that perform comparably to SIFT descriptors for discriminating different types of scene. The generative ability of the model also makes it easy to see what information is captured and what is lost at each level of representation. 1. Introduction and Previous
Stacks of Convolutional Restricted Boltzmann Machines for ShiftInvariant Feature Learning
"... In this paper we present a method for learning classspecific features for recognition. Recently a greedy layerwise procedure was proposed to initialize weights of deep belief networks, by viewing each layer as a separate Restricted Boltzmann Machine (RBM). We develop the Convolutional RBM (CRBM), a ..."
Abstract

Cited by 36 (1 self)
 Add to MetaCart
(Show Context)
In this paper we present a method for learning classspecific features for recognition. Recently a greedy layerwise procedure was proposed to initialize weights of deep belief networks, by viewing each layer as a separate Restricted Boltzmann Machine (RBM). We develop the Convolutional RBM (CRBM), a variant of the RBM model in which weights are shared to respect the spatial structure of images. This framework learns a set of features that can generate the images of a specific object class. Our feature extraction model is a four layer hierarchy of alternating filtering and maximum subsampling. We learn feature parameters of the first and third layers viewing them as separate CRBMs. The outputs of our feature extraction hierarchy are then fed as input to a discriminative classifier. It is experimentally demonstrated that the extracted features are effective for object detection, using them to obtain performance comparable to the stateoftheart on handwritten digit recognition and pedestrian detection. 1.
Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring
"... In this paper we address the following question: “Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small minibatch of dataitems for every sample we generate?”. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previ ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
(Show Context)
In this paper we address the following question: “Can we approximately sample from a Bayesian posterior distribution if we are only allowed to touch a small minibatch of dataitems for every sample we generate?”. An algorithm based on the Langevin equation with stochastic gradients (SGLD) was previously proposed to solve this, but its mixing rate was slow. By leveraging the Bayesian Central Limit Theorem, we extend the SGLD algorithm so that at high mixing rates it will sample from a normal approximation of the posterior, while for slow mixing rates it will mimic the behavior of SGLD with a preconditioner matrix. As a bonus, the proposed algorithm is reminiscent of Fisher scoring (with stochastic gradients) and as such an efficient optimizer during burnin. 1.