• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Histograms of Sparse Codes for Object Detection. In CVPR, (2013)

by X Ren, D Ramanan
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 28
Next 10 →

Rich feature hierarchies for accurate object detection and semantic segmentation

by Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
"... Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scala ..."
Abstract - Cited by 251 (23 self) - Add to MetaCart
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that im-proves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural net-works (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. Source code for the complete system is available at
(Show Context)

Citation Context

....3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7 DPM ST [26] 23.8 58.2 10.5 8.5 27.1 50.4 52.0 7.3 19.2 22.8 18.1 8.0 55.9 44.8 32.4 13.3 15.9 22.8 46.2 44.9 29.1 DPM HSC =-=[28]-=- 32.2 58.3 11.5 16.3 30.6 49.9 54.8 23.5 21.5 27.7 34.0 13.7 58.1 51.6 39.9 12.4 23.5 34.4 47.4 45.2 34.3 Table 2: Detection average precision (%) on VOC 2007 test. Rows 1-3 show R-CNN performance wit...

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

by Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell
"... We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks an ..."
Abstract - Cited by 203 (22 self) - Add to MetaCart
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms. 1.
(Show Context)

Citation Context

...ready been engineered (Le et al., 2011). Recent results have shown that moderately deep unsupervised models outperform the state-of-the art gradient histogram features in part-based detection models (=-=Ren & Ramanan, 2013-=-). Deep models have recently been applied to large-scale visual recognition tasks, trained via back-propagation through layers of convolutional filters (LeCun et al., 1989). These models perform extre...

Learning and transferring mid-level image representations using convolutional neural networks

by Maxime Oquab, Leon Bottou Ivan Laptev, Josef Sivic - In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR , 2014
"... Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level f ..."
Abstract - Cited by 71 (3 self) - Add to MetaCart
Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large-scale visual recognition challenge (ILSVRC2012). The suc-cess of CNNs is attributed to their ability to learn rich mid-level image representations as opposed to hand-designed low-level features used in other image classification meth-ods. Learning CNNs, however, amounts to estimating mil-lions of parameters and requires a very large number of annotated image samples. This property currently prevents application of CNNs to problems with limited training data. In this work we show how image representations learned with CNNs on large-scale annotated datasets can be effi-ciently transferred to other visual recognition tasks with limited amount of training data. We design a method to reuse layers trained on the ImageNet dataset to compute mid-level image representation for images in the PASCAL VOC dataset. We show that despite differences in image statistics and tasks in the two datasets, the transferred rep-resentation leads to significantly improved results for object and action classification, outperforming the current state of the art on Pascal VOC 2007 and 2012 datasets. We also show promising results for object and action localization. 1.
(Show Context)

Citation Context

... classification. Most of the recent image classification methods follow the bag-of-features pipeline [7]. Densely-sampled SIFT descriptors [33] are typically quantized using unsupervised clustering (k-means, GMM). Histogram encoding [7, 46], spatial pooling [27] and more recent Fisher Vector encoding [37] are common methods for feature aggregation. While such representations have been shown to work well in practice, it is unclear why they should be optimal for the task. This question raised considerable interest in the subject of mid-level features [5, 23, 45], and feature learning in general [29, 39, 48]. The goal of this work is to show that convolutional network layers provide generic mid-level image representations that can be transferred to new tasks. Deep Learning. The recent revival of interest in multilayer neural networks was triggered by a growing number of works on learning intermediate representations, either using unsupervised methods, as in [19, 28], or using more traditional supervised techniques, as in [12, 25]. 3. Transferring CNN weights The CNN architecture of [25] contains more than 60 million parameters. Directly learning so many parameters from only a few thousand trainin...

Unsupervised feature learning for 3d scene labeling

by Kevin Lai, Liefeng Bo, Dieter Fox - In ICRA , 2014
"... Abstract — This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online ..."
Abstract - Cited by 11 (0 self) - Add to MetaCart
Abstract — This paper presents an approach for labeling objects in 3D scenes. We introduce HMP3D, a hierarchical sparse coding technique for learning features from 3D point cloud data. HMP3D classifiers are trained using a synthetic dataset of virtual scenes generated using CAD models from an online database. Our scene labeling system combines features learned from raw RGB-D images and 3D point clouds directly, without any hand-designed features, to assign an object label to every 3D point in the scene. Experiments on the RGB-D Scenes Dataset v.2 demonstrate that the proposed approach can be used to label indoor scenes containing both small tabletop objects and large furniture pieces. I.
(Show Context)

Citation Context

...the Intel Science and Technology Center on Pervasive Computing, Seattle, WA 98195, USA. liefeng.bo@gmail.com networks have shown promising results on image classification [14], [15], object detection =-=[16]-=-, and scene understanding [17]. Unlike hand-designed features which are implicitly limited by the design choices of their creators, these algorithms are fully data-driven and can learn sophisticated r...

Deformable part models are convolutional neural networks

by Ross Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik - CoRR
"... ..."
Abstract - Cited by 10 (2 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...l regions of the sub-type. At test time, a DPM is run as a sliding-window detector over a feature pyramid, which is traditionally built using HOG features (alternatives have recently been explored in =-=[29, 32]-=-). A DPM score is assigned to each sliding-window location by optimizing a score function that trades off part deformation costs with image match scores. A global maximum of the score function is comp...

Modeling Image Patches with a Generic Dictionary of Mini-Epitomes

by George Papandreou, Liang-chieh Chen, Alan L. Yuille
"... The goal of this paper is to question the necessity of fea-tures like SIFT in categorical visual recognition tasks. As an alternative, we develop a generative model for the raw intensity of image patches and show that it can support im-age classification performance on par with optimized SIFT-based ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
The goal of this paper is to question the necessity of fea-tures like SIFT in categorical visual recognition tasks. As an alternative, we develop a generative model for the raw intensity of image patches and show that it can support im-age classification performance on par with optimized SIFT-based techniques in a bag-of-visual-words setting. Key in-gredient of the proposed model is a compact dictionary of mini-epitomes, learned in an unsupervised fashion on a large collection of images. The use of epitomes allows us to explicitly account for photometric and position vari-ability in image appearance. We show that this flexibility considerably increases the capacity of the dictionary to ac-curately approximate the appearance of image patches and support recognition tasks. For image classification, we de-velop histogram-based image encoding methods tailored to the epitomic representation, as well as an “epitomic foot-print ” encoding which is easy to visualize and highlights the generative nature of our model. We discuss in detail computational aspects and develop efficient algorithms to make the model scalable to large tasks. The proposed tech-niques are evaluated with experiments on the challenging PASCAL VOC 2007 image classification benchmark. 1.
(Show Context)

Citation Context

...-based models such as those described in [6], despite the fact that they learn a multi-layered feature representation. The power of learned patch-level features has also been demonstrated recently in =-=[5, 9, 24]-=-. Using mini-epitomes instead of image patches could also prove beneficial in their setting. Sparsity provides a compelling framework for learning image patch dictionaries [22]. Sparsity coupled with ...

End-to-end integration of a convolutional network, deformable parts model and non-maximum suppression. arXiv

by Li Wan, David Eigen, Rob Fergus , 2014
"... Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detec-tion. Yet these two approaches find their strengths in com-plementary areas: DPMs are well-versed in object compo-sition, modeling fine-grained spatial relationships between parts; likewise, Con ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
Deformable Parts Models and Convolutional Networks each have achieved notable performance in object detec-tion. Yet these two approaches find their strengths in com-plementary areas: DPMs are well-versed in object compo-sition, modeling fine-grained spatial relationships between parts; likewise, ConvNets are adept at producing power-ful image features, having been discriminatively trained di-rectly on the pixels. In this paper, we propose a new model that combines these two approaches, obtaining the advan-tages of each. We train this model using a new structured loss function that considers all bounding boxes within an image, rather than isolated object instances. This enables the non-maximal suppression (NMS) operation, previously treated as a separate post-processing stage, to be integrated into the model. This allows for discriminative training of our combined Convnet + DPM + NMS model in end-to-end fashion. We evaluate our system on PASCAL VOC 2007 and 2011 datasets, achieving competitive results on both bench-marks. 1.
(Show Context)

Citation Context

... 27.2 10.2 34.8 30.2 28.2 46.6 41.7 26.2 10.3 32.8 26.8 39.8 47.0 30.5 HoG-dpm(v5) [5] 33.2 60.3 10.2 16.1 27.3 54.3 58.2 23.0 20.0 24.1 26.7 12.7 58.1 48.2 43.2 12.0 21.1 36.1 46.0 43.5 33.7 HSC-dpm =-=[12]-=- 32.2 58.3 11.5 16.3 30.6 49.9 54.8 23.5 21.5 27.7 34.0 13.7 58.1 51.6 39.9 12.4 23.5 34.4 47.4 45.2 34.3 Regionlets [17] 54.2 52.0 20.3 24.0 20.1 55.5 68.7 42.6 19.2 44.2 49.1 26.6 57.0 54.5 43.4 16....

Filtered channel features for pedestrian detection

by Shanshan Zhang, Rodrigo Benenson, Bernt Schiele - CVPR, 2015. Random Projection Feature for Pedestrian Detection PLOS ONE | DOI:10.1371/journal.pone.0142820 November 16, 2015 9 / 10
"... This paper starts from the observation that multiple top performing pedestrian detectors can be modelled by using an intermediate layer filtering low-level features in combin-ation with a boosted decision forest. Based on this observa-tion we propose a unifying framework and experimentally explore d ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
This paper starts from the observation that multiple top performing pedestrian detectors can be modelled by using an intermediate layer filtering low-level features in combin-ation with a boosted decision forest. Based on this observa-tion we propose a unifying framework and experimentally explore different filter families. We report extensive results enabling a systematic analysis. Using filtered channel features we obtain top perform-ance on the challenging Caltech and KITTI datasets, while using only HOG+LUV as low-level features. When adding optical flow features we further improve detection quality and report the best known results on the Caltech dataset, reaching 93 % recall at 1 FPPI. 1.
(Show Context)

Citation Context

...ance amongst features (commonly colour, gradient, and oriented gradient) [36, 28]. Etc.) Other features include bag-of-words over colour, HOG, or LBP features [4]; learning sparse dictionary encoders =-=[32]-=-; and training features via a convolutional neural network [34]. Additional features specific for stereo depth or optical flow have been proposed, however we consider these beyond the focus of this pa...

Deepid-net: Deformable deep convolutional neural networks for object detection

by Wanli Ouyang, Xiaogang Wang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Chen-change Loy, Xiaoou Tang - In CVPR , 2015
"... In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the def ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric con-straint and penalty. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of model averag-ing. The proposed approach improves the mean averaged precision obtained by RCNN [14], which was the state-of-the-art, from 31 % to 50.3 % on the ILSVRC2014 detection test set. It also outperforms the winner of ILSVRC2014, GoogLeNet, by 6.1%. Detailed component-wise analysis is also provided through extensive experimental evaluation, which provide a global view for people to understand the deep learning object detection pipeline. 1.
(Show Context)

Citation Context

...aset, we follow the approach in [14] for splitting the training and testing data. Table 2 shows the experimental results on VOC-2007 testing data, which include approaches using hand-crafted features =-=[15, 33, 47, 46, 11]-=-, deep CNN features [14, 19], and CNN features with deformation learning [16]. Since all the state-of-the-art works reported single-model results on this dataset, we also report the single-model resul...

Fast Template Evaluation with Vector Quantization

by Mohammad Amin Sadeghi, David Forsyth - In Advances in Neural Information Processing Systems (NIPS , 2013
"... Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of appro ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Applying linear templates is an integral part of many object detection systems and accounts for a significant portion of computation time. We describe a method that achieves a substantial end-to-end speedup over the best current methods, without loss of accuracy. Our method is a combination of approximating scores by vector quantizing feature windows and a number of speedup techniques including cas-cade. Our procedure allows speed and accuracy to be traded off in two ways: by choosing the number of Vector Quantization levels, and by choosing to rescore windows or not. Our method can be directly plugged into any recognition system that relies on linear templates. We demonstrate our method to speed up the orig-inal Exemplar SVM detector [1] by an order of magnitude and Deformable Part models [2] by two orders of magnitude with no loss of accuracy. 1
(Show Context)

Citation Context

...ores. We parallelized [7] and the HOG feature extraction function for fair comparison. We evaluate all running times on a XEON E5-1650 Processor (6 Cores, 12MB Cache, 3.20 GHz). 6 Method mAP time HSC =-=[20]-=- 0.343 180s* WTA [10] 0.240 26s* DPM V5 [22] 0.330 13.3s DPM V4 [21] 0.301 13.2s DPM V3 [2] 0.268 11.6s Rigid templates [23] 0.31 10s* Method mAP time Vedaldi [12] 0.277 7s* DPM V4 -parts 0.214 2.8s F...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University