• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. (2014)

by M Oquab, L Bottou, I Laptev, J Sivic
Venue:In Proc. CVPR,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 71
Next 10 →

Visualizing and understanding convolutional networks

by Matthew D. Zeiler, Rob Fergus - In Computer Vision–ECCV 2014 , 2014
"... Abstract. Large Convolutional Network models have recently demon-strated impressive classification performance on the ImageNet bench-mark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. ..."
Abstract - Cited by 133 (3 self) - Add to MetaCart
Abstract. Large Convolutional Network models have recently demon-strated impressive classification performance on the ImageNet bench-mark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the oper-ation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets. 1
(Show Context)

Citation Context

...s a single exclusive prediction for each image. Table 5 shows the results on the test set, comparing to the leading methods: the top 2 entries in the competition and concurrent work from Oquab et al. =-=[21]-=- who use a convnet with a more appropriate classifier. The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter. This may explain our mean 2 For Cal...

Return of the Devil in the Details: Delving Deep into Convolutional Nets

by Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman , 2014
"... The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compar ..."
Abstract - Cited by 71 (8 self) - Add to MetaCart
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.

Neural codes for image retrieval

by Artem Babenko, Anton Slesarev, R Chigorin, Victor Lempitsky - in ECCV , 2014
"... Abstract. It has been shown that the activations invoked by an image within the top layers of a large convolutional neural network provide a high-level descriptor of the visual content of the image. In this paper, we investigate the use of such descriptors (neural codes) within the image retrieval a ..."
Abstract - Cited by 17 (1 self) - Add to MetaCart
Abstract. It has been shown that the activations invoked by an image within the top layers of a large convolutional neural network provide a high-level descriptor of the visual content of the image. In this paper, we investigate the use of such descriptors (neural codes) within the image retrieval application. In the experiments with several standard retrieval benchmarks, we establish that neural codes perform competitively even when the convolutional neural network has been trained for an unrelated classification task (e.g. Image-Net). We also evaluate the improvement in the retrieval performance of neural codes, when the network is retrained on a dataset of images that are similar to images encountered at test time. We further evaluate the performance of the compressed neural codes and show that a simple PCA compression provides very good short codes that give state-of-the-art accuracy on a number of datasets. In general, neural codes turn out to be much more resilient to such compression in comparison other state-of-the-art descriptors. Finally, we show that discriminative dimensionality reduction trained on a dataset of pairs of matched photographs improves the performance of PCA-compressed neural codes even further. Overall, our quantitative experiments demon-strate the promise of neural codes as visual descriptors for image re-trieval.
(Show Context)

Citation Context

... holistic descriptors. Furthermore, we investigate in details how retraining of a CNN on different datasets impact the retrieval performance of the corresponding neural codes. Another concurrent work =-=[17]-=- investigated how similar retraining can be used to adapt the Image-Net derived networks to smaller classification datasets. 3 Using Pretrained Neural Codes Deep convolutional architecture. In this se...

From Captions to Visual Concepts and Back

by Hao Fang, Li Deng, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick , Geoffrey Zweig, et al. , 2014
"... This paper presents a novel approach for automatically generating image descriptions: visual detectors and language models learn directly from a dataset of image captions. We use Multiple Instance Learning to train visual detectors for words that commonly occur in captions, including many different ..."
Abstract - Cited by 15 (1 self) - Add to MetaCart
This paper presents a novel approach for automatically generating image descriptions: visual detectors and language models learn directly from a dataset of image captions. We use Multiple Instance Learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. When human judges compare the system captions to ones written by other people, the system captions have equal or better quality over 23 % of the time.

From generic to specific deep representations for visual recognition

by Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, Stefan Carlsson - CoRR
"... Evidence is mounting that CNNs are currently the most efficient and successful way to learn visual representations. This paper address the questions on why CNN representations are so effective and how to improve them if one wants to maximize performance for a single task or a range of tasks. We asse ..."
Abstract - Cited by 7 (2 self) - Add to MetaCart
Evidence is mounting that CNNs are currently the most efficient and successful way to learn visual representations. This paper address the questions on why CNN representations are so effective and how to improve them if one wants to maximize performance for a single task or a range of tasks. We assess experimentally the importance of different aspects of learning and choosing a CNN representation to its performance on a diverse set of visual recognition tasks. In particular, we investigate how altering the parameters in a network’s architecture and its training impacts the representation’s ability to specialize and generalize. We also study the effect of fine-tuning a generic network towards a particular task. Extensive exper-iments indicate the trends; (a) increasing specialization increases performance on the target task but can hurt the ability to generalize to other tasks and (b) the less specialized the original network the more likely it is to benefit from fine-tuning. As by-products we have learnt several deep CNN image representations which when combined with a simple linear SVM classifier or similarity measure pro-duce the best performance on 12 standard datasets measuring the ability to solve visual recognition tasks ranging from image classification to image retrieval. 1
(Show Context)

Citation Context

...12, 29] trained with large scale datasets, such as ImageNet [1], to solve the hardest visual recognition tasks, see figure 1. Excitingly deep CNNs can also learn powerful generic image representations=-=[27, 8, 21]-=-. These representations can be exploited very simply to solve a large range of recognition tasks [27]. In fact the performance of these representations are so good that at this juncture in computer vi...

CNN: Single-label to multi-label

by Yunchao Wei, Wei Xia, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, Senior Member, Shuicheng Yan, Senior Member - CoRR
"... Abstract—Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Abstract—Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), where an arbitrary number of object segment hypotheses are taken as the inputs, then a shared CNN is connected with each hypothesis, and finally the CNN output results from different hypotheses are aggregated with max pooling to produce the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include: 1) no ground-truth bounding box information is required for training; 2) the whole HCP infrastructure is robust to possibly noisy and/or redundant hypotheses; 3) no explicit hypothesis label is required; 4) the shared CNN may be well pre-trained with a large-scale single-label image dataset, e.g. ImageNet; and 5) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts. In particular, the mAP reaches 84.2 % by HCP only and 90.3 % after the fusion with our complementary result in [47] based on hand-crafted features on the VOC2012 dataset, which significantly outperforms the state-of-the-arts with a large margin of more than 7%.
(Show Context)

Citation Context

...structure possesses the following characteristics: • No ground-truth bounding box information is required for training on the multi-label image dataset. Different from previous works [12], [5], [15], =-=[35]-=-, which employ ground-truth bounding box information for training, the proposed HCP requires no bounding box annotation. Since bounding box annotation is much more costly than labelling, the annotatio...

Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging.

by Jianlong Fu , Yue Wu , Tao Mei , Jinqiao Wang , Hanqing Lu , Yong Rui , 2015
"... Abstract The development of deep learning has empowered machines with comparable capability of recognizing limited image categories to human beings. However, most existing approaches heavily rely on human-curated training data, which hinders the scalability to large and unlabeled vocabularies in im ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
Abstract The development of deep learning has empowered machines with comparable capability of recognizing limited image categories to human beings. However, most existing approaches heavily rely on human-curated training data, which hinders the scalability to large and unlabeled vocabularies in image tagging. In this paper, we propose a weakly-supervised deep learning model which can be trained from the readily available Web images to relax the dependence on human labors and scale up to arbitrary tags (categories). Specifically, based on the assumption that features of true samples in a category tend to be similar and noises tend to be variant, we embed the feature map of the last deep layer into a new affinity representation, and further minimize the discrepancy between the affinity representation and its low-rank approximation. The discrepancy is finally transformed into the objective function to give relevance feedback to back propagation. Experiments show that we can achieve a performance gain of 14.0% in terms of a semantic-based relevance metric in image tagging with 63,043 tags from the WordNet, against the typical deep model trained on the ImageNet 1,000 vocabulary set.
(Show Context)

Citation Context

...e samples with large reconstruction error. The removal ratio is set as the same as the noise percentage. • CAE+CNN: we pre-train the convolutional layers of CNN by the convolutional autoencoder (CAE) in a layer-wise way and fine-tune the entire network, which is reported in [15] to reduce the noise effect. • NL+CNN: we reproduce the additional bottom-up noise-adaption layer in [21], and combine this layer with CNN network. We also compared with another two methods in VOC2007. • Best VOC: pre-training using ImageNet, and finetuning in VOC2007, which has achieved the state-ofthe-art performance [18]. • Web HOG: training concept representations by the part-based model and human-crafted features with Web training images [22], which is the most recent work in this topic. Results: First of all, we adjusted the weight decay value of the basic CNN model, i.e., β in Eqn. (2), on the two datasets. For different noise percentages (from 10% to 90%), this value is 0.004 for 10%, 0.008 for 20%, and 0.04 for the rest. We found that the above parameters can make the basic CNN model achieve the best result on both datasets. In addition, we empirically set γ to 0.1 in Eqn. (3) so that the similarity val...

Material recognition in the wild with the materials in context database

by Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala - arXiv:1412.0623
"... Recognizing materials in real-world images is a challeng-ing task. Real-world materials have rich surface texture, geometry, lighting conditions, and clutter, which combine to make the problem particularly difficult. In this paper, we introduce a new, large-scale, open dataset of materials in the wi ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Recognizing materials in real-world images is a challeng-ing task. Real-world materials have rich surface texture, geometry, lighting conditions, and clutter, which combine to make the problem particularly difficult. In this paper, we introduce a new, large-scale, open dataset of materials in the wild, the Materials in Context Database (MINC), and combine this dataset with deep learning to achieve material recognition and segmentation of images in the wild. MINC is an order of magnitude larger than previous ma-terial databases, while being more diverse and well-sampled across its 23 categories. Using MINC, we train convolu-tional neural networks (CNNs) for two tasks: classifying materials from patches, and simultaneous material recogni-tion and segmentation in full images. For patch-based clas-sification on MINC we found that the best performing CNN architectures can achieve 85.2 % mean class accuracy. We convert these trained CNN classifiers into an efficient fully convolutional framework combined with a fully connected conditional random field (CRF) to predict the material at every pixel in an image, achieving 73.1 % mean class ac-curacy. Our experiments demonstrate that having a large, well-sampled dataset such as MINC is crucial for real-world material recognition and segmentation. 1.
(Show Context)

Citation Context

...feat [24], and VGG [27]. Finally, relevant to our goal of per-pixel material segmentation, Farabet et al. [6] use a multi-scale CNN to predict the class at every pixel in a segmentation. Oquab et al. =-=[18]-=- employ a sliding window approach to localize patch classification of objects. We build on this body of work in deep learning to solve our problem of material recognition and segmentation. Brick Carpe...

R-CNNs for Pose Estimation and Action Detection

by Georgia Gkioxari, Bharath Hariharan, Ross Girshick, Jitendra Malik
"... We present convolutional neural networks for the tasks of keypoint (pose) predic-tion and action classification of people in unconstrained images. Our approach involves training an R-CNN detector with loss functions depending on the task being tackled. We evaluate our method on the challenging PASCA ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
We present convolutional neural networks for the tasks of keypoint (pose) predic-tion and action classification of people in unconstrained images. Our approach involves training an R-CNN detector with loss functions depending on the task being tackled. We evaluate our method on the challenging PASCAL VOC dataset and compare it to previous leading approaches. Our method gives state-of-the-art results for keypoint and action prediction. Additionally, we introduce a new dataset for action detection, the task of simultaneously localizing people and clas-sifying their actions, and present results using our approach. 1
(Show Context)

Citation Context

...aluation measures AP for each action independently. Our approach achieves 70.5% mAP on the PASCAL VOC action test set for action classification and is slightly better than the previous leading method =-=[20]-=- (70.2%), which also uses CNNs. The standard method for evaluating action classification (reported above) assumes that ground-truth object locations are given at test time and one only needs to output...

L.-J.: Visual Sentiment Prediction with Deep Convolutional Neural Networks. In: arXiv preprint arXiv:1411.5731

by Suleyman Cetintas, Kuang-chih Lee, Li-jia Li , 2014
"... Images have become one of the most popular types of media through which users convey their emotions within online social networks. Although vast amount of research is devoted to sentiment analysis of textual data, there has been very limited work that focuses on analyz-ing sentiment of image data. I ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Images have become one of the most popular types of media through which users convey their emotions within online social networks. Although vast amount of research is devoted to sentiment analysis of textual data, there has been very limited work that focuses on analyz-ing sentiment of image data. In this work, we propose a novel visual sentiment prediction framework that per-forms image understanding with Convolutional Neural Networks (CNN). Specifically, the proposed sentiment prediction framework performs transfer learning from a CNN with millions of parameters, which is pre-trained on large-scale data for object recognition. Experiments conducted on two real-world datasets from Twitter and Tumblr demonstrate the effectiveness of the proposed visual sentiment analysis framework. 1
(Show Context)

Citation Context

... that perform transfer learning across different domains. (Le 2013) reported success with transferring deep representations to small datasets as CIFAR and MINST. Recent studies (Donahue et al. 2014) (=-=Oquab et al. 2014-=-) show that the parameters of CNN trained on large-scale dataset such as ILSVRC can be transferred to object recognition and scene classification tasks when the data is limited, resulting better perfo...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2016 The Pennsylvania State University