Results 11 - 20
of
91
S.: What are you talking about? text-to-image coreference
- In: CVPR (2014
"... In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disam-biguate t ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disam-biguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D ob-jects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effec-tiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D de-tection and scene classification accuracy, and is able to re-liably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system [15]. 1.
Spatio-Temporal Object Detection Proposals
"... Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the fra ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the frames. Recently methods that generate unsupervised detection proposals have proven to be very effective for object detection in still images. These methods open the possibility to use strong but computationally expensive features since only a relatively small number of detection hypotheses need to be assessed. In this paper we make two contribu-tions towards exploiting detection proposals for spatio-temporal detection prob-lems. First, we extend a recent 2D object proposal method, to produce spatio-temporal proposals by a randomized supervoxel merging process. We introduce spatial, temporal, and spatio-temporal pairwise supervoxel features that are used to guide the merging process. Second, we propose a new efficient supervoxel method. We experimentally evaluate our detection proposals, in combination with our new supervoxel method as well as existing ones. This evaluation shows that our supervoxels lead to more accurate proposals when compared to using existing state-of-the-art supervoxel methods. 1
Generating object segmentation proposals using global and local search. The
- IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2014
"... We present a method for generating object segmentation proposals from groups of superpixels. The goal is to pro-pose accurate segmentations for all objects of an image. The proposed object hypotheses can be used as input to object detection systems and thereby improve efficiency by replacing exhaust ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
We present a method for generating object segmentation proposals from groups of superpixels. The goal is to pro-pose accurate segmentations for all objects of an image. The proposed object hypotheses can be used as input to object detection systems and thereby improve efficiency by replacing exhaustive search. The segmentations are gener-ated in a class-independent manner and therefore the com-putational cost of the approach is independent of the num-ber of object classes. Our approach combines both global and local search in the space of sets of superpixels. The local search is implemented by greedily merging adjacent pairs of superpixels to build a bottom-up segmentation hi-erarchy. The regions from such a hierarchy directly provide a part of our region proposals. The global search provides the other part by performing a set of graph cut segmen-tations on a superpixel graph obtained from an intermedi-ate level of the hierarchy. The parameters of the graph cut problems are learnt in such a manner that they provide com-plementary sets of regions. Experiments with Pascal VOC images show that we reach state-of-the-art with greatly re-duced computational cost. 1.
Image segmentation by cascaded region agglomeration
- In CVPR
, 2013
"... We propose a hierarchical segmentation algorithm that starts with a very fine oversegmentation and gradually merges regions using a cascade of boundary classifiers. This approach allows the weights of region and boundary features to adapt to the segmentation scale at which they are applied. The stag ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We propose a hierarchical segmentation algorithm that starts with a very fine oversegmentation and gradually merges regions using a cascade of boundary classifiers. This approach allows the weights of region and boundary features to adapt to the segmentation scale at which they are applied. The stages of the cascade are trained sequentially, with asymetric loss to maximize boundary recall. On six segmentation data sets, our algorithm achieves best performance under most region-quality measures, and does it with fewer segments than the prior work. Our algorithm is also highly competitive in a dense oversegmentation (superpixel) regime under boundary-based measures. 1.
The Role of Context for Object Detection and Semantic Segmentation in the Wild
"... In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 de-tection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it cont ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
In this paper we study the role of context in existing state-of-the-art detection and segmentation approaches. Towards this goal, we label every pixel of PASCAL VOC 2010 de-tection challenge with a semantic category. We believe this data will provide plenty of challenges to the community, as it contains 520 additional classes for semantic segmenta-tion and object detection. Our analysis shows that near-est neighbor based approaches perform poorly on semantic segmentation of contextual classes, showing the variability of PASCAL imagery. Furthermore, improvements of exist-ing contextual models for detection is rather modest. In order to push forward the performance in this difficult sce-nario, we propose a novel deformable part-based model, which exploits both local context around each candidate de-tection as well as global context at the level of the scene. We show that this contextual reasoning significantly helps in detecting objects at all scales. 1.
Composite Statistical Inference for Semantic Segmentation
"... In this paper we present an inference procedure for the semantic segmentation of images. Different from many CRF approaches that rely on dependencies modeled with unary and pairwise pixel or superpixel potentials, our method is entirely based on estimates of the overlap between each of a set of mid- ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
In this paper we present an inference procedure for the semantic segmentation of images. Different from many CRF approaches that rely on dependencies modeled with unary and pairwise pixel or superpixel potentials, our method is entirely based on estimates of the overlap between each of a set of mid-level object segmentation proposals and the objects present in the image. We define continuous latent variables on superpixels obtained by multiple intersections of segments, then output the optimal segments from the inferred superpixel statistics. The algorithm is capable of recombine and refine initial mid-level proposals, as well as handle multiple interacting objects, even from the same class, all in a consistent joint inference framework by maximizing the composite likelihood of the underlying statistical model using an EM algorithm. In the PASCAL VOC segmentation challenge, the proposed approach obtains high accuracy and successfully handles images of complex object interactions. 1.
Fisher and VLAD with FLAIR
"... A major computational bottleneck in many current al-gorithms is the evaluation of arbitrary boxes. Dense lo-cal analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the represen ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
A major computational bottleneck in many current al-gorithms is the evaluation of arbitrary boxes. Dense lo-cal analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with `2 and power-norms. Finally, by multiple codeword assignments, we achieve ex-act and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the-art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge. 1.
Semi-intrinsic mean shift on riemannian manifolds
- Proc. European Conference on Computer Vision (ECCV), Lecture Notes in Computer Science
, 2012
"... Abstract. The original mean shift algorithm [1] on Euclidean spaces (MS) was extended in [2] to operate on general Riemannian manifolds. This extension is extrinsic (Ext-MS) since the mode seeking is performed on the tangent spaces [3], where the underlying curvature is not fully con-sidered (tangen ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
Abstract. The original mean shift algorithm [1] on Euclidean spaces (MS) was extended in [2] to operate on general Riemannian manifolds. This extension is extrinsic (Ext-MS) since the mode seeking is performed on the tangent spaces [3], where the underlying curvature is not fully con-sidered (tangent spaces are only valid in a small neighborhood). In [3] was proposed an intrinsic mean shift designed to operate on two par-ticular Riemannian manifolds (IntGS-MS), i.e. Grassmann and Stiefel manifolds (using manifold-dedicated density kernels). It is then natural to ask whether mean shift could be intrinsically extended to work on a large class of manifolds. We propose a novel paradigm to intrinsically reformulate the mean shift on general Riemannian manifolds. This is ac-complished by embedding the Riemannian manifold into a Reproducing Kernel Hilbert Space (RKHS) by using a general and mathematically well-founded Riemannian kernel function, i.e. heat kernel [4]. The key issue is that when the data is implicitly mapped to the Hilbert space, the curvature of the manifold is taken into account (i.e. exploits the underlying information of the data). The inherent optimization is then performed on the embedded space. Theoretic analysis and experimental results demonstrate the promise and effectiveness of this novel paradigm. 1
Efficient Multi-Cue Scene Segmentation
- In Lecture Notes in Computer Science (Proc. of the German Conf. on Pattern Recognition (GCPR
, 2013
"... Abstract. This paper presents a novel multi-cue framework for scene segmentation, involving a combination of appearance (grayscale images) and depth cues (dense stereo vision). An efficient 3D environment model is utilized to create a small set of meaningful free-form region hypothe-ses for object l ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract. This paper presents a novel multi-cue framework for scene segmentation, involving a combination of appearance (grayscale images) and depth cues (dense stereo vision). An efficient 3D environment model is utilized to create a small set of meaningful free-form region hypothe-ses for object location and extent. Those regions are subsequently cat-egorized into several object classes using an extended multi-cue bag-of-features pipeline. For that, we augment grayscale bag-of-features by bag-of-depth-features operating on dense disparity maps, as well as height pooling to incorporate a 3D geometric ordering into our region descriptor. In experiments on a large real-world stereo vision data set, we obtain state-of-the-art segmentation results at significantly reduced computa-tional costs. Our dataset is made public for benchmarking purposes. 1
Discriminatively trained dense surface normal estimation
- In Proc. of ECCV (2014
"... Abstract. In this work we propose the method for a rather unexplored problem of computer vision- discriminatively trained dense surface nor-mal estimation from a single image. Our method combines contextual and segment-based cues and builds a regressor in a boosting framework by transforming the pro ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this work we propose the method for a rather unexplored problem of computer vision- discriminatively trained dense surface nor-mal estimation from a single image. Our method combines contextual and segment-based cues and builds a regressor in a boosting framework by transforming the problem into the regression of coefficients of a local coding. We apply our method to two challenging data sets containing images of man-made environments, the indoor NYU2 data set and the outdoor KITTI data set. Our surface normal predictor achieves results better than initially expected, significantly outperforming state-of-the-art. 1