• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Unifying visual-semantic embeddings with multimodal neural language models,” TACL, (2015)

by R Kiros, R Salakhutdinov, R S Zemel
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 26
Next 10 →

Deep visual-semantic alignments for generating image descriptions

by Andrej Karpathy, Li Fei-fei , 2014
"... We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combinati ..."
Abstract - Cited by 47 (0 self) - Add to MetaCart
We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

Long-term recurrent convolutional networks for visual recognition and description

by Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell , 2014
"... ..."
Abstract - Cited by 39 (1 self) - Add to MetaCart
Abstract not found

Show and tell: A neural image caption generator

by Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2014
"... Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep re-current architecture that combines recent advances in computer vision an ..."
Abstract - Cited by 32 (2 self) - Add to MetaCart
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep re-current architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.

DEEP CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (M-RNN)

by Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Alan Yuille - UNDER REVIEW AS A CONFERENCE PAPER AT ICLR 2015 , 2015
"... In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consis ..."
Abstract - Cited by 13 (1 self) - Add to MetaCart
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

Generative Moment Matching Networks

by Yujia Li, Kevin Swersky
"... We consider the problem of learning deep gener-ative models from data. We formulate a method that generates an independent sample via a sin-gle feedforward pass through a multilayer pre-ceptron, as in the recently proposed generative adversarial networks (Goodfellow et al., 2014). Training a generat ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
We consider the problem of learning deep gener-ative models from data. We formulate a method that generates an independent sample via a sin-gle feedforward pass through a multilayer pre-ceptron, as in the recently proposed generative adversarial networks (Goodfellow et al., 2014). Training a generative adversarial network, how-ever, requires careful optimization of a difficult minimax program. Instead, we utilize a tech-nique from statistical hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simple objective that can be interpreted as matching all orders of statistics between a dataset and samples from the model, and can be trained by backpropagation. We further boost the performance of this approach by combining our generative network with an auto-encoder net-work, using MMD to learn to generate codes that can then be decoded to produce samples. We show that the combination of these techniques yields excellent generative models compared to baseline approaches as measured on MNIST and the Toronto Face Database. 1.
(Show Context)

Citation Context

... generation (Vinyals et al., 2014; Fang et al., 2014; Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). =-=Kiros et al., 2014-=-), machine translation (Cho et al., 2014; Sutskever et al., 2014), and more. Despite their successes, one of the main bottlenecks of the supervised approach is the difficulty in obtaining enough data ...

Model Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

by K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Discussion Yunchen Pu, Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio , 2015
"... Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by ma ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the cor-responding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO. 1.

A Dataset for Movie Description

by Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele - In CVPR , 2015
"... Descriptive video service (DVS) provides linguistic de-scriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter-esting data source for computer vision and computational linguistic ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Descriptive video service (DVS) provides linguistic de-scriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter-esting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel cor-pus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmark-ing different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production. 1.
(Show Context)

Citation Context

...8, 32, 62, 16, 26, 64, 55] with natural language. While recent works on image description show impressive results by learning the relations between images and sentences and generating novel sentences =-=[41, 19, 48, 56, 35, 31, 65, 13]-=-, the video description works typically rely on retrieval or templates [16, 63, 26, 27, 37, 39, 62] and frequently use a separate language corpus to model the linguistic statistics. A few exceptions e...

1VQA: Visual Question Answering

by Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh
"... Abstract—We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring many real-world scenarios, such as helping the visually impaired, both the q ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract—We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring many real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing 100, 000’s of images and questions and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance. F
(Show Context)

Citation Context

...ideo captioning that combines Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Representation & Reasoning (KR) has dramatically increased in the past year [14], [7], [10], [31], =-=[22]-=-, [20], [42]. Part of this excitement stems from a belief that multi-discipline tasks like image captioning are a step towards solving AI. However, the current state of the art demonstrates that a coa...

Jointly modeling embedding and translation to bridge video and language.

by Yingwei Pan , Tao Mei , Ting Yao , Houqiang Li , † , Yong Rui , 2016
"... Abstract Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs) ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Abstract Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs)
(Show Context)

Citation Context

...ere is a wide variety of video applications based on the description, ranging from editing, indexing, search, to sharing. However, the problem itself has been taken as a grand challenge for decades in the research communities, as the description generation model should be powerful enough not only to recognize key objects from visual content, but also discover their spatio-temporal relationships and the dynamics expressed in a natural language. Despite the difficulty of the problem, there have been a few attempts to address video description generation [5, 30, 34], and image caption generation [6, 13, 16, 31], which are mainly inspired by recent advances in machine translation using RNN [1]. Among these successful attempts, most of them use Long Short-Term Memory (LSTM) [9], a variant of RNN, which can capture long-term temporal information by mapping sequences to sequences. Thus, we follow this elegant recipe and use LSTM as our RNN model to generate the video sentence in this paper. However, existing approaches to video description generation mainly optimize the next word given the input video and previous words locally, while leaving the relationship between the semantics of the entire sentence...

Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

by Yukun Zhu, Ryan Kiros, Richard Zemel Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, Sanja Fidler
"... Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semanti-cally far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural em-bedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demon-strate good quantitative performance for movie/book align-ment and show several qualitative examples that showcase the diversity of tasks our model can be used for. 1.
(Show Context)

Citation Context

...otten significant attention in the past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning =-=[13, 11, 36, 35, 21]-=-, alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University