• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Data generation as sequential decision making. (2015)

by Philip Bachman
Venue:NIPS
Add To MetaCart

Tools

Sorted by:
Results 1 - 1 of 1

Generating Images from Captions with Attention

by Elman Mansimov , Emilio Parisotto , Jimmy Lei Ba , Ruslan Salakhutdinov
"... Abstract Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on MS COCO, we compare ou ..."
Abstract - Add to MetaCart
Abstract Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on MS COCO, we compare our models with several baseline generative models on image generation and retrieval tasks. We demonstrate our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset. For more details, visit http://arxiv.org/abs/ 1511.02793.
(Show Context)

Citation Context

...ons. 2.2 Image Modelling: the Conditional alignDRAW Network To generate images conditioned on the caption information, we extended the DRAW network [1] to include caption representation hlang at each step, shown in Figure 1. The conditional DRAW network is a stochastic recurrent neural network that consists of a set of latent variables Zt at each time step. Unlike the original DRAW network where latent variables are independent unit Gaussians N (0, I), the latent variables in the proposed alignDRAW model have their mean and variance depend on the previous recurrent hidden states hdect−1 as in [4]. Formally, the image is generated by iteratively computing the following equations for t = 1, ..., T (see Figure 1): zt ∼ p(Zt|Z1:t−1) = N (µt(hdect−1), σt(hdect−1)), (1) hdect = LSTM dec(hdect−1, zt, st−1), (2) st = align(h dec t−1,h lang); ct = ct−1 + write(h dec t ), (3) where write and read are the same attention operators as in [1]. The align function is used to compute the alignment between the input caption and intermediate image generative steps [5]. Given the caption representation from the language model, hlang = [hlang1 , h lang 2 , ..., h lang N ], the align operator outputs a dyn...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University