Results 1 - 10
of
28
Rich feature hierarchies for accurate object detection and semantic segmentation
"... Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scala ..."
Abstract
-
Cited by 251 (23 self)
- Add to MetaCart
(Show Context)
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that im-proves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural net-works (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. Source code for the complete system is available at
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
"... We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks an ..."
Abstract
-
Cited by 203 (22 self)
- Add to MetaCart
(Show Context)
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms. 1.
Learning and transferring mid-level image representations using convolutional neural networks
- In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2014
"... Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level f ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
(Show Context)
Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level features used in other image classification meth-ods. Learning CNNs, however, amounts to estimating mil-lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi-ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep-resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization. 1.
Unsupervised feature learning for 3d scene labeling
- In ICRA
, 2014
"... Abstract — This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Abstract — This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online database. Our scene labeling system combines features learned from raw RGB-D images and 3D point clouds directly, without any hand-designed features, to assign an object label to every 3D point in the scene. Experiments on the RGB-D Scenes Dataset v.2 demonstrate that the proposed approach can be used to label indoor scenes containing both small tabletop objects and large furniture pieces. I.
Modeling Image Patches with a Generic Dictionary of Mini-Epitomes
"... The goal of this paper is to question the necessity of fea-tures like SIFT in categorical visual recognition tasks. As an alternative, we develop a generative model for the raw intensity of image patches and show that it can support im-age classification performance on par with optimized SIFT-based ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
The goal of this paper is to question the necessity of fea-tures like SIFT in categorical visual recognition tasks. As an alternative, we develop a generative model for the raw intensity of image patches and show that it can support im-age classification performance on par with optimized SIFT-based techniques in a bag-of-visual-words setting. Key in-gredient of the proposed model is a compact dictionary of mini-epitomes, learned in an unsupervised fashion on a large collection of images. The use of epitomes allows us to explicitly account for photometric and position vari-ability in image appearance. We show that this flexibility considerably increases the capacity of the dictionary to ac-curately approximate the appearance of image patches and support recognition tasks. For image classification, we de-velop histogram-based image encoding methods tailored to the epitomic representation, as well as an “epitomic foot-print ” encoding which is easy to visualize and highlights the generative nature of our model. We discuss in detail computational aspects and develop efficient algorithms to make the model scalable to large tasks. The proposed tech-niques are evaluated with experiments on the challenging PASCAL VOC 2007 image classification benchmark. 1.
End-to-end integration of a convolutional network, deformable parts model and non-maximum suppression. arXiv
, 2014
"... Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detec-tion. Yet these two approaches find their strengths in com-plementary areas: DPMs are well-versed in object compo-sition, modeling fine-grained spatial relationships between parts; likewise, Con ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detec-tion. Yet these two approaches find their strengths in com-plementary areas: DPMs are well-versed in object compo-sition, modeling fine-grained spatial relationships between parts; likewise, ConvNets are adept at producing power-ful image features, having been discriminatively trained di-rectly on the pixels. In this paper, we propose a new model that combines these two approaches, obtaining the advan-tages of each. We train this model using a new structured loss function that considers all bounding boxes within an image, rather than isolated object instances. This enables the non-maximal suppression (NMS) operation, previously treated as a separate post-processing stage, to be integrated into the model. This allows for discriminative training of our combined Convnet + DPM + NMS model in end-to-end fashion. We evaluate our system on PASCAL VOC 2007 and 2011 datasets, achieving competitive results on both bench-marks. 1.
Filtered channel features for pedestrian detection
- CVPR, 2015. Random Projection Feature for Pedestrian Detection PLOS ONE | DOI:10.1371/journal.pone.0142820 November 16, 2015 9 / 10
"... This paper starts from the observation that multiple top performing pedestrian detectors can be modelled by using an intermediate layer filtering low-level features in combin-ation with a boosted decision forest. Based on this observa-tion we propose a unifying framework and experimentally explore d ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
This paper starts from the observation that multiple top performing pedestrian detectors can be modelled by using an intermediate layer filtering low-level features in combin-ation with a boosted decision forest. Based on this observa-tion we propose a unifying framework and experimentally explore different filter families. We report extensive results enabling a systematic analysis. Using filtered channel features we obtain top perform-ance on the challenging Caltech and KITTI datasets, while using only HOG+LUV as low-level features. When adding optical flow features we further improve detection quality and report the best known results on the Caltech dataset, reaching 93 % recall at 1 FPPI. 1.
Deepid-net: Deformable deep convolutional neural networks for object detection
- In CVPR
, 2015
"... In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the def ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric con-straint and penalty. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of model averag-ing. The proposed approach improves the mean averaged precision obtained by RCNN [14], which was the state-of-the-art, from 31 % to 50.3 % on the ILSVRC2014 detection test set. It also outperforms the winner of ILSVRC2014, GoogLeNet, by 6.1%. Detailed component-wise analysis is also provided through extensive experimental evaluation, which provide a global view for people to understand the deep learning object detection pipeline. 1.
Fast Template Evaluation with Vector Quantization
- In Advances in Neural Information Processing Systems (NIPS
, 2013
"... Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of appro ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cas-cade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the orig-inal Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1