Results 1 - 10
of
14
Modeling deep temporal dependencies with recurrent grammar cells
- In Advances in Neural Information Processing Systems 27
, 2014
"... We propose modeling time series by representing the transformations that take a frame at time t to a frame at time t+1. To this end we show how a bi-linear model of transformations, such as a gated autoencoder, can be turned into a recurrent net-work, by training it to predict future frames from the ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
We propose modeling time series by representing the transformations that take a frame at time t to a frame at time t+1. To this end we show how a bi-linear model of transformations, such as a gated autoencoder, can be turned into a recurrent net-work, by training it to predict future frames from the current one and the inferred transformation using backprop-through-time. We also show how stacking multi-ple layers of gating units in a recurrent pyramid makes it possible to represent the ”syntax ” of complicated time series, and that it can outperform standard recurrent neural networks in terms of prediction accuracy on a variety of tasks. 1
Unsupervised Learning of Video Representations using LSTMs
"... We use Long Short Term Memory (LSTM) networks to learn representations of video se-quences. Our model uses an encoder LSTM to map an input sequence into a fixed length rep-resentation. This representation is decoded us-ing single or multiple decoder LSTMs to perform different tasks, such as reconstr ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We use Long Short Term Memory (LSTM) networks to learn representations of video se-quences. Our model uses an encoder LSTM to map an input sequence into a fixed length rep-resentation. This representation is decoded us-ing single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences – patches of image pixels and high-level repre-sentations (“percepts”) of video frames extracted using a pretrained convolutional net. We ex-plore different design choices such as whether the decoder LSTMs should condition on the gen-erated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We further evaluate the representations by finetuning them for a supervised learning problem – human ac-tion recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only few training examples. Even mod-els pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition per-formance. 1.
Who Do I Look Like? Determining Parent-Offspring Resemblance via Gated Autoencoders
"... Recent years have seen a major push for face recogni-tion technology due to the large expansion of image shar-ing on social networks. In this paper, we consider the diffi-cult task of determining parent-offspring resemblance using deep learning to answer the question “Who do I look like?” Although h ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Recent years have seen a major push for face recogni-tion technology due to the large expansion of image shar-ing on social networks. In this paper, we consider the diffi-cult task of determining parent-offspring resemblance using deep learning to answer the question “Who do I look like?” Although humans can perform this job at a rate higher than chance, it is not clear how they do it [2]. However, recent studies in anthropology [24] have determined which fea-tures tend to be the most discriminative. In this study, we aim to not only create an accurate system for resemblance detection, but bridge the gap between studies in anthropol-ogy with computer vision techniques. Further, we aim to an-swer two key questions: 1) Do offspring resemble their par-ents? and 2) Do offspring resemble one parent more than the other? We propose an algorithm that fuses the features and metrics discovered via gated autoencoders with a dis-criminative neural network layer that learns the optimal, or what we call genetic, features to delineate parent-offspring relationships. We further analyze the correlation between our automatically detected features and those found in an-thropological studies. Meanwhile, our method outperforms the state-of-the-art in kinship verification by 3-10 % depend-ing on the relationship using specific (father-son, mother-daughter, etc.) and generic models. 1.
Action-Conditional Video Prediction using Deep Networks in Atari Games
"... Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future image-frames de-pend on control variables or actions as well as previous frames. While ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future image-frames de-pend on control variables or actions as well as previous frames. While not com-posed of natural scenes, frames in Atari games are high-dimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the pro-posed architectures are able to generate visually-realistic frames that are also use-ful for control over approximately 100-step action-conditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs. 1
K.: Modeling sequential data using higherorder relational features and predictive training. arXiv preprint arXiv:1402.2333
, 2014
"... Bi-linear feature learning models, like the gated autoencoder, were proposed as a way to model relationships between frames in a video. By min-imizing reconstruction error of one frame, given the previous frame, these models learn “mapping units ” that encode the transformations inherent in a sequen ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Bi-linear feature learning models, like the gated autoencoder, were proposed as a way to model relationships between frames in a video. By min-imizing reconstruction error of one frame, given the previous frame, these models learn “mapping units ” that encode the transformations inherent in a sequence, and thereby learn to encode mo-tion. In this work we extend bi-linear models by introducing “higher order mapping units ” that al-low us to encode transformations between frames and transformations between transformations. We show that this makes it possible to encode temporal structure that is more complex and lon-ger-range than the structure captured within stan-dard bi-linear models. We also show that a nat-ural way to train the model is by replacing the commonly used reconstruction objective with a prediction objective which forces the model to correctly predict the evolution of the input multi-ple steps into the future. Learning can be achieved by back-propagating the multi-step prediction through time. We test the model on various temporal prediction tasks, and show that higher-order mappings and predic-tive training both yield a significant improvement over bi-linear models in terms of prediction accu-racy.
Mental rotation by optimizing transforming distance
- In NIPS Deep Learning and Representation Learning Workshop
, 2014
"... The human visual system is able to recognize objects despite transformations that can drastically alter their appearance. To this end, much effort has been devoted to the invariance properties of recognition systems. Invariance can be engineered (e.g. convolutional nets), or learned from data explic ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
The human visual system is able to recognize objects despite transformations that can drastically alter their appearance. To this end, much effort has been devoted to the invariance properties of recognition systems. Invariance can be engineered (e.g. convolutional nets), or learned from data explicitly (e.g. temporal coherence) or implicitly (e.g. by data augmentation). One idea that has not, to date, been explored is the integration of latent variables which permit a search over a learned space of transformations. Motivated by evidence that people mentally simulate transformations in space while comparing examples, so-called “mental rotation”, we propose a transforming distance. Here, a trained relational model actively transforms pairs of examples so that they are maximally similar in some feature space yet respect the learned transformational constraints. We apply our method to nearest-neighbour problems on the Toronto Face Database and NORB. 1
Zero-bias autoencoders and the benefits of co-adapting features
, 2014
"... We show that training common regularized autoencoders resembles clustering, because it amounts to fitting a density model whose mass is concentrated in the directions of the individ-ual weight vectors. We then propose a new ac-tivation function based on thresholding a linear function with zero bias ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
We show that training common regularized autoencoders resembles clustering, because it amounts to fitting a density model whose mass is concentrated in the directions of the individ-ual weight vectors. We then propose a new ac-tivation function based on thresholding a linear function with zero bias (so it is truly linear not affine), and argue that this allows hidden units to “collaborate ” in order to define larger regions of uniform density. We show that the new activa-tion function makes it possible to train autoen-coders without an explicit regularization penalty, such as sparsification, contraction or denoising, by simply minimizing reconstruction error. Ex-periments in a variety of recognition tasks show that zero-bias autoencoders perform about on par with common regularized autoencoders on low dimensional data and outperform these by an in-creasing margin as the dimensionality of the data increases. 1.
Domain-Size Pooling in Local Descriptors: DSP-SIFT
"... We introduce a simple modification of local image de-scriptors, such as SIFT, based on pooling gradient orienta-tions across different domain sizes, in addition to spatial locations. The resulting descriptor, which we call DSP-SIFT, outperforms other methods in wide-baseline matching benchmarks, inc ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
We introduce a simple modification of local image de-scriptors, such as SIFT, based on pooling gradient orienta-tions across different domain sizes, in addition to spatial locations. The resulting descriptor, which we call DSP-SIFT, outperforms other methods in wide-baseline matching benchmarks, including those based on convolutional neural networks, despite having the same dimension of SIFT and requiring no training. 1.
Multi-view feature engineering and learning
- In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015; ArXiv preprint:1311.6048
"... We frame the problem of local representation of imaging data as the computation of minimal sufficient statistics that are invariant to nuisance variability induced by viewpoint and illumination. We show that, under very stringent condi-tions, these are related to “feature descriptors ” commonly used ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
We frame the problem of local representation of imaging data as the computation of minimal sufficient statistics that are invariant to nuisance variability induced by viewpoint and illumination. We show that, under very stringent condi-tions, these are related to “feature descriptors ” commonly used in Computer Vision. Such conditions can be relaxed if multiple views of the same scene are available. We pro-pose a sampling-based and a point-estimate based approx-imation of such a representation, compared empirically on image-to-(multiple)image matching, for which we introduce a multi-view wide-baseline matching benchmark, consisting of a mixture of real and synthetic objects with ground truth camera motion and dense three-dimensional geometry. 1.
Unsupervised Learning of Visual Representations using Videos
, 2015
"... This is a review of unsupervised learning applied to videos with the aim of learning visual representations. We look at different realizations of the notion of temporal coherence across various models. We try to understand the challenges being faced, the strengths and weaknesses of different approac ..."
Abstract
- Add to MetaCart
This is a review of unsupervised learning applied to videos with the aim of learning visual representations. We look at different realizations of the notion of temporal coherence across various models. We try to understand the challenges being faced, the strengths and weaknesses of different approaches and identify directions for future work.