Results 1 - 10
of
20
Understanding indoor scenes using 3d geometric phrases
- In CVPR
, 2013
"... Visual scene understanding is a difficult problem inter-leaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reason-able amount of ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
(Show Context)
Visual scene understanding is a difficult problem inter-leaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reason-able amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relation-ships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving indi-vidual object detections. 1.
People Watching -- Human Actions as a Cue for Single View Geometry
"... We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These c ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene under-standing approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Inter-net. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.
Box in the box: Joint 3D layout and object reasoning from single images
, 2013
"... In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. To-wards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account o ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
(Show Context)
In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. To-wards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.
Unfolding an Indoor Origami World
"... Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other constraints. Our method takes a variety of cues and uses them to infer a consistent interpretation of the scene. We demon-strate improvements over the state-of-the art and produce interpretations of the scene that link large planar surfaces. 1
Efficient Structured Parsing of Façades Using Dynamic Programming
"... We propose a sequential optimization technique for seg-menting a rectified image of a façade into semantic cate-gories. Our method retrieves a parsing which respects com-mon architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the conside ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
We propose a sequential optimization technique for seg-menting a rectified image of a façade into semantic cate-gories. Our method retrieves a parsing which respects com-mon architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the considered façade labeling problem is typically tackled as a classification task or as grammar parsing. Both ap-proaches are not capable of fully exploiting the regularity of the problem. Therefore, our technique very significantly im-proves the accuracy compared to the state-of-the-art while being an order of magnitude faster. In addition, in 85 % of the test images we obtain a certificate for optimality. 1.
3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
"... We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D mode ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D models of a scene from a single image. Moreover, we can transfer information across images to accurately label and segment objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-beforeseen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation. By decoupling the viewpoint and the geometry of an image, we develop a scene matching approach which is truly 100% viewpoint invariant, yielding state-of-the-art performance on challenging data. 1.
Discrete-Continuous Depth Estimation from a Single Image
"... In this paper, we tackle the problem of estimating the depth of a scene from a single image. This is a challeng-ing task, since a single image on its own does not provide any depth cue. To address this, we exploit the availability of a pool of images for which the depth is known. More specifically, ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we tackle the problem of estimating the depth of a scene from a single image. This is a challeng-ing task, since a single image on its own does not provide any depth cue. To address this, we exploit the availability of a pool of images for which the depth is known. More specifically, we formulate monocular depth estimation as a discrete-continuous optimization problem, where the con-tinuous variables encode the depth of the superpixels in the input image, and the discrete ones represent relation-ships between neighboring superpixels. The solution to this discrete-continuous optimization problem is then obtained by performing inference in a graphical model using parti-cle belief propagation. The unary potentials in this graph-ical model are computed by making use of the images with known depth. We demonstrate the effectiveness of our model in both the indoor and outdoor scenarios. Our experimen-tal evaluation shows that our depth estimates are more ac-curate than existing methods on standard datasets. 1.
PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding
"... Abstract. The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360◦ full-view panoramas in scene understanding, and propose a whole- ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360◦ full-view panoramas in scene understanding, and propose a whole-room context model in 3D. For an input panorama, our method outputs 3D bounding boxes of the room and all major objects inside, together with their semantic categories. Our method generates 3D hypotheses based on contextual constraints and ranks the hypotheses holistically, combining both bottom-up and top-down context infor-mation. To train our model, we construct an annotated panorama dataset and re-construct the 3D model from single-view using manual annotation. Experiments show that solely based on 3D context without any image-based object detector, we can achieve a comparable performance with the state-of-the-art object detec-tor. This demonstrates that when the FOV is large, context is as powerful as object appearance. All data and source code are available online. 1
Designing deep networks for surface normal estimation
"... Figure 1: Given a single image, our algorithm estimates the surface normal at each pixel. Notice how our algorithm not only estimates the coarse structure also captures fine local details. For example, on the left, the normals of the couch arm and side table legs are estimated accurately (see zoomed ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Figure 1: Given a single image, our algorithm estimates the surface normal at each pixel. Notice how our algorithm not only estimates the coarse structure also captures fine local details. For example, on the left, the normals of the couch arm and side table legs are estimated accurately (see zoomed version). On the right, the chair surface and legs and even the top of the shopping bags are captured correctly. Normal legend: blue → X; green → Y; red → Z. In the past few years, convolutional neural nets (CNN) have shown incredible promise for learning visual repre-sentations. In this paper, we use CNNs for the task of pre-dicting surface normals from a single image. But what is the right architecture we should use? We propose to build upon the decades of hard work in 3D scene understand-ing, to design new CNN architecture for the task of surface normal estimation. We show by incorporating several con-straints (man-made, manhattan world) and meaningful in-termediate representations (room layout, edge labels) in the architecture leads to state of the art performance on surface normal estimation. We also show that our network is quite robust and show state of the art results on other datasets as well without any fine-tuning. 1.
Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes
, 2013
"... Junctions are strong cues for understanding the geometry of a scene. In this paper, we consider the problem of detecting junctions and using them for recovering the spatial layout of an indoor scene. Junction detection has always been challenging due to missing and spurious lines. We work in a const ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Junctions are strong cues for understanding the geometry of a scene. In this paper, we consider the problem of detecting junctions and using them for recovering the spatial layout of an indoor scene. Junction detection has always been challenging due to missing and spurious lines. We work in a constrained Manhattan world setting where the junctions are formed by only line segments along the three principal orthogonal directions. Junctions can be classified into several categories based on the number and orientations of the incident line segments. We provide a simple and efficient voting scheme to detect and classify these junctions in real images. Indoor scenes are typically modeled as cuboids and we formulate the problem of the layout estimation as an inference problem in a conditional random field. Our formulation allows the incorporation of junction features and the training is done using structured prediction. We outperform other single view geometry estimation methods on standard datasets.