Results 1 - 10
of
20
Regionlets for generic object detection
, 2013
"... Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to eval-uate for many locations. In view of this, we propose to mod ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to eval-uate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detec-tion window at an arbitrary resolution (i.e. size and as-pect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spa-tial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tol-erate deformations. Then we evaluate the object bound-ing box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec-tion mean average precision of 41.7 % on the PASCAL VOC 2007 dataset and 39.7 % on the VOC 2010 for 20 object cat-egories. It achieves 14.7 % mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4.7%. 1.
Multi-fold MIL Training for Weakly Supervised Object Localization
, 2014
"... Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
Object category localization is a challenging problem in computer vision. Standard supervised training requires bounding box annotations of object instances. This time-consuming annotation process is sidestepped in weakly supervised learning. In this case, the supervised information is restricted to binary labels that indicate the absence/presence of object instances in the image, without their locations. We follow a multiple-instance learning approach that iteratively trains the detector and infers the object locations in the positive training images. Our main contribution is a multi-fold multiple instance learning procedure, which prevents training from prematurely locking onto erroneous object locations. This procedure is particularly important when high-dimensional representations, such as the Fisher vectors, are used. We present a detailed experimental evaluation using the PASCAL VOC 2007 dataset. Compared to state-of-the-art weakly supervised detectors, our approach better localizes objects in the training images, which translates into improved detection performance.
Spatio-Temporal Object Detection Proposals
"... Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the fra ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the frames. Recently methods that generate unsupervised detection proposals have proven to be very effective for object detection in still images. These methods open the possibility to use strong but computationally expensive features since only a relatively small number of detection hypotheses need to be assessed. In this paper we make two contribu-tions towards exploiting detection proposals for spatio-temporal detection prob-lems. First, we extend a recent 2D object proposal method, to produce spatio-temporal proposals by a randomized supervoxel merging process. We introduce spatial, temporal, and spatio-temporal pairwise supervoxel features that are used to guide the merging process. Second, we propose a new efficient supervoxel method. We experimentally evaluate our detection proposals, in combination with our new supervoxel method as well as existing ones. This evaluation shows that our supervoxels lead to more accurate proposals when compared to using existing state-of-the-art supervoxel methods. 1
Fisher and VLAD with FLAIR
"... A major computational bottleneck in many current al-gorithms is the evaluation of arbitrary boxes. Dense lo-cal analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the represen ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
A major computational bottleneck in many current al-gorithms is the evaluation of arbitrary boxes. Dense lo-cal analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with `2 and power-norms. Finally, by multiple codeword assignments, we achieve ex-act and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the-art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge. 1.
Efficient Action Localization with Approximately Normalized Fisher Vectors
"... The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word represen-tation. Transformation of the FV by power and ℓ2 normal-izations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classifica ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word represen-tation. Transformation of the FV by power and ℓ2 normal-izations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks. These normaliza-tions, however, render the representation non-additive over local descriptors. Combined with its high dimensionality, this makes the FV computationally expensive for the pur-pose of localization tasks. In this paper we present approx-imations to both these normalizations, which yield signifi-cant improvements in the memory and computational costs of the FV when used for localization. Second, we show how these approximations can be used to define upper-bounds on the score function that can be efficiently evaluated, which enables the use of branch-and-bound search as an alterna-tive to exhaustive sliding window search. We present ex-perimental evaluation results on classification and tempo-ral localization of actions in videos. These show that the our approximations lead to a speedup of at least one or-der of magnitude, while maintaining state-of-the-art action recognition and localization performance. 1.
A discriminative cnn video representation for event detection
- In CVPR
, 2015
"... In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where on ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkit. This paper makes two contributions to the inference of CNN video representa-tion. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be sig-nificantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally afford-able. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8 % for the TRECVID MEDTest 14 dataset and from 34.0 % to 44.6 % for the TRECVID MEDTest 13 dataset. 1. Introduction and Related
Accurate Object Detection with Location Relaxation and Regionlets Re-localization
"... Abstract. Standard sliding window based object detection requires dense clas-sifier evaluation on densely sampled locations in scale space in order to achieve an accurate localization. To avoid such dense evaluation, selective search based algorithms only evaluate the classifier on a small subset of ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Standard sliding window based object detection requires dense clas-sifier evaluation on densely sampled locations in scale space in order to achieve an accurate localization. To avoid such dense evaluation, selective search based algorithms only evaluate the classifier on a small subset of object proposals. Notwithstanding the demonstrated success, object proposals do not guarantee perfect overlap with the object, leading to a suboptimal detection accuracy. To address this issue, we propose to first relax the dense sampling of the scale space with coarse object proposals generated from bottom-up segmentations. Based on detection results on these proposals, we then conduct a top-down search to more precisely localize the object using supervised descent. This two-stage detection strategy, dubbed location relaxation, is able to localize the object in the contin-uous parameter space. Furthermore, there is a conflict between accurate object detection and robust object detection. That is because the achievement of the later requires the accommodation of inaccurate and perturbed object locations in the training phase. To address this conflict, we leverage the rich spatial informa-tion learned from the Regionlets detection framework to determine where the ob-ject is precisely localized. Our proposed approaches are extensively validated on the PASCAL VOC 2007 dataset and a self-collected large scale car dataset. Our method boosts the mean average precision of the current state-of-the-art (41.7%) to 44.1 % on PASCAL VOC 2007 dataset. To our best knowledge, it is the best performance reported without using outside data 4. 1
Bags of Spacetime Energies for Dynamic Scene Recognition
"... This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition. The approach builds on primitive features that uniformly capture spa-tial and temporal orientation structure of the imagery (e.g., video), as extracted via application of a bank of spatiotem-porally orien ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
This paper presents a unified bag of visual word (BoW) framework for dynamic scene recognition. The approach builds on primitive features that uniformly capture spa-tial and temporal orientation structure of the imagery (e.g., video), as extracted via application of a bank of spatiotem-porally oriented filters. Various feature encoding tech-niques are investigated to abstract the primitives to an inter-mediate representation that is best suited to dynamic scene representation. Further, a novel approach to adaptive pool-ing of the encoded features is presented that captures spa-tial layout of the scene even while being robust to situations where camera motion and scene dynamics are confounded. The resulting overall approach has been evaluated on two standard, publically available dynamic scene datasets. The results show that in comparison to a representative set of al-ternatives, the proposed approach outperforms the previous state-of-the-art in classification accuracy by 10%. 1.
Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images
- In Proceedings of Computer Vision and Pattern Recognition. IEEE
, 2015
"... We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered in-door scenes containing many visual categories and in-stances. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layo ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered in-door scenes containing many visual categories and in-stances. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layout hypotheses at multiple loca-tions and scales in the image followed by a sequential in-ference algorithm that produces a complete scene estimate. Our contributions can be summarized as follows: (1) a gen-eralization of parametric max flow figure-ground proposal methodology to take advantage of intensity and depth in-formation, in order to systematically and efficiently gen-erate the breakpoints of an underlying spatial model in polynomial time, (2) new region description methods based on second-order pooling over multiple features constructed using both intensity and depth channels, (3) a principled search-based structured prediction inference and learning process that resolves conflicts in overlapping spatial par-titions and selects regions sequentially towards complete scene estimates, and (4) extensive evaluation of the impact of depth, as well as the effectiveness of a large number of descriptors, both pre-designed and automatically obtained using deep learning, in a difficult RGB-D semantic segmen-tation problem with 92 classes. We report state of the art results in the challenging NYU Depth Dataset V2 [44], ex-tended for the RMRC 2013 and RMRC 2014 Indoor Seg-mentation Challenges, where currently the proposed model ranks first. Moreover, we show that by combining second-order and deep learning features, over 15 % relative ac-curacy improvements can be additionally achieved. In a scene classification benchmark, our methodology further improves the state of the art by 24%. 1. Introduction and Related
Self-learning camera: Autonomous adaptation of object detectors to unlabeled video streams
- In ECCV
, 2014
"... Learning object detectors requires massive amounts of labeled training samples from the specific data source of interest. This is impractical when dealing with many different sources (e.g., in camera networks), or constantly changing ones such as mobile cameras (e.g., in robotics or driving assistan ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Learning object detectors requires massive amounts of labeled training samples from the specific data source of interest. This is impractical when dealing with many different sources (e.g., in camera networks), or constantly changing ones such as mobile cameras (e.g., in robotics or driving assistant systems). In this pa-per, we address the problem of self-learning detectors in an autonomous manner, i.e. (i) detectors continuously updating themselves to efficiently adapt to stream-ing data sources (contrary to transductive algorithms), (ii) without any labeled data strongly related to the target data stream (contrary to self-paced learning), and (iii) without manual intervention to set and update hyper-parameters. To that end, we propose an unsupervised, on-line, and self-tuning learning algorithm to optimize a multi-task learning convex objective. Our method uses confident but laconic or-acles (high-precision but low-recall off-the-shelf generic detectors), and exploits the structure of the problem to jointly learn on-line an ensemble of instance-level trackers, from which we derive an adapted category-level object detector. Our approach is validated on real-world publicly available video object datasets. 1