Results 1 - 10
of
251
Caffe: Convolutional architecture for fast feature embedding
, 2014
"... Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose conv ..."
Abstract
-
Cited by 192 (8 self)
- Add to MetaCart
(Show Context)
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep mod-els efficiently on commodity architectures. Caffe fits indus-try and internet-scale media needs by CUDA GPU computa-tion, processing over 40 million images a day on a single K40 or Titan GPU ( ≈ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows ex-perimentation and seamless switching among platforms for ease of development and deployment from prototyping ma-chines to cloud environments. Caffe is maintained and developed by the Berkeley Vi-sion and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers on-going research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Return of the Devil in the Details: Delving Deep into Convolutional Nets
, 2014
"... The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compar ..."
Abstract
-
Cited by 71 (8 self)
- Add to MetaCart
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.
Learning and transferring mid-level image representations using convolutional neural networks
- In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2014
"... Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level f ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
(Show Context)
Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level features used in other image classification meth-ods. Learning CNNs, however, amounts to estimating mil-lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi-ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep-resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization. 1.
Large-scale Video Classification with Convolutional Neural Networks
"... Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
(Show Context)
Convolutional Neural Networks (CNNs) have been es-tablished as a powerful class of models for image recog-nition problems. Encouraged by these results, we pro-vide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multi-ple approaches for extending the connectivity of the a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated archi-tecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant per-formance improvements compared to strong feature-based baselines (55.3 % to 63.9%), but only a surprisingly mod-est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant per-formance improvements compared to the UCF-101 baseline model (63.3 % up from 43.9%). 1.
Spatial pyramid pooling in deep convolutional networks for visual recognition
- In ECCV
"... Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled poo ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to elimi-nate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test im-ages, our method computes convolutional features 30-170 × faster than the recent leading method R-CNN (and 24-64 × faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.1 1
Deep visual-semantic alignments for generating image descriptions
, 2014
"... We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combinati ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
Regionlets for generic object detection
, 2013
"... Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to eval-uate for many locations. In view of this, we propose to mod ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to eval-uate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detec-tion window at an arbitrary resolution (i.e. size and as-pect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spa-tial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tol-erate deformations. Then we evaluate the object bound-ing box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec-tion mean average precision of 41.7 % on the PASCAL VOC 2007 dataset and 39.7 % on the VOC 2010 for 20 object cat-egories. It achieves 14.7 % mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4.7%. 1.
DeepPose: Human pose estimation via deep neural networks
- In CVPR
, 2014
"... Figure 1. Besides extreme variability in articulations, many of the joints are barely visible. We can guess the location of the right arm in the left image only because we see the rest of the pose and anticipate the motion or activity of the person. Similarly, the left body half of the person on the ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
(Show Context)
Figure 1. Besides extreme variability in articulations, many of the joints are barely visible. We can guess the location of the right arm in the left image only because we see the rest of the pose and anticipate the motion or activity of the person. Similarly, the left body half of the person on the right is not visible at all. These are examples of the need for holistic reasoning. We believe that DNNs can naturally provide such type of reasoning. We propose a method for human pose estimation based on Deep Neural Networks (DNNs). The pose estimation is formulated as a DNN-based regression problem towards body joints. We present a cascade of such DNN regres-sors which results in high precision pose estimates. The approach has the advantage of reasoning about pose in a holistic fashion and has a simple but yet powerful formula-tion which capitalizes on recent advances in Deep Learn-ing. We present a detailed empirical analysis with state-of-art or better performance on four academic benchmarks of diverse real-world images. 1.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
, 2015
"... Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU ..."
Abstract
-
Cited by 40 (0 self)
- Add to MetaCart
(Show Context)
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on the learnable activation and advanced initialization, we achieve 4.94 % top-5 test error on the ImageNet 2012 classification dataset. This is a 26 % relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66 % [33]). To our knowledge, our result is the first1 to surpass the reported human-level performance (5.1%, [26]) on this dataset.
Fully convolutional networks for semantic segmentation
, 2014
"... Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
(Show Context)
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolu-tional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [17], the VGG net [28], and GoogLeNet [29]) into fully convolu-tional networks and transfer their learned representations by fine-tuning [2] to the segmentation task. We then de-fine a novel architecture that combines semantic informa-tion from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20 % rela-tive improvement to 62.2 % mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image. 1.