• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Efficient Exact Inference for 3D Indoor Scene Understanding

by Er G. Schwing, Raquel Urtasun
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 20
Next 10 →

Understanding indoor scenes using 3d geometric phrases

by Wongun Choi, Yu-wei Chao, Caroline Pantofaru, Silvio Savarese - In CVPR , 2013
"... Visual scene understanding is a difficult problem inter-leaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reason-able amount of ..."
Abstract - Cited by 21 (5 self) - Add to MetaCart
Visual scene understanding is a difficult problem inter-leaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reason-able amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relation-ships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving indi-vidual object detections. 1.
(Show Context)

Citation Context

...ut object supports. Bao et al. [2, 1] utilized geometric relationship to help object detection and scene structure estimation. Several methods attempted to specifically solve indoor layout estimation =-=[12, 13, 27, 30, 22, 26, 25]-=-. Hedau et al. proposed a formulation using a cubic room representation [12] and showed that layout estimation can improve object detection [13]. This initial attempt demonstrated promising results, h...

People Watching -- Human Actions as a Cue for Single View Geometry

by David F. Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A. Efros, Ivan Laptev, Josef Sivic
"... We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These c ..."
Abstract - Cited by 19 (4 self) - Add to MetaCart
We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene under-standing approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Inter-net. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.

Box in the box: Joint 3D layout and object reasoning from single images

by Alexander G. Schwing, Sanja Fidler, Marc Pollefeys, Raquel Urtasun , 2013
"... In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. To-wards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account o ..."
Abstract - Cited by 11 (3 self) - Add to MetaCart
In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. To-wards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.
(Show Context)

Citation Context

...e literature were shown to be decomposable into pairwise potentials. As a consequence denser parameterizations were used resulting in significant performance gains. More recently, Schwing and Urtasun =-=[26]-=- showed that the global optimum of typical layout scoring functions is obtained by employing a branch and bound approach. This resulted in provably optimal solutions that are computed in real time on ...

Unfolding an Indoor Origami World

by David F. Fouhey, Abhinav Gupta, Martial Hebert
"... Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other ..."
Abstract - Cited by 8 (1 self) - Add to MetaCart
Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other constraints. Our method takes a variety of cues and uses them to infer a consistent interpretation of the scene. We demon-strate improvements over the state-of-the art and produce interpretations of the scene that link large planar surfaces. 1
(Show Context)

Citation Context

...at deal of effort went into developing constrained models for the prediction of room layout [10] as well as 4 D.F. Fouhey, A. Gupta, M. Hebert features [6, 23, 27] and effective methods for inference =-=[4, 22, 31, 32]-=-. While these high-level constraints have been enormously successful in constrained domains (e.g., less cluttered scenes with visible floors such as the datasets of [10, 38]), they have not been succe...

Efficient Structured Parsing of Façades Using Dynamic Programming

by Andrea Cohen, Er G. Schwing, Marc Pollefeys
"... We propose a sequential optimization technique for seg-menting a rectified image of a façade into semantic cate-gories. Our method retrieves a parsing which respects com-mon architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the conside ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
We propose a sequential optimization technique for seg-menting a rectified image of a façade into semantic cate-gories. Our method retrieves a parsing which respects com-mon architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the considered façade labeling problem is typically tackled as a classification task or as grammar parsing. Both ap-proaches are not capable of fully exploiting the regularity of the problem. Therefore, our technique very significantly im-proves the accuracy compared to the state-of-the-art while being an order of magnitude faster. In addition, in 85 % of the test images we obtain a certificate for optimality. 1.
(Show Context)

Citation Context

...r its rigorous optimization has, according to our opinion, been forgotten in the past few years where many tasks are formulated as labeling problems. Some exceptions are the room layout estimation of =-=[26]-=- and finding the globally optimal bounding box given classifier scores [16, 2]. We argue that construction of applications on top of scene interpretations, e.g., by predicting affordances [8] or by in...

3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding

by Scott Satkin, Martial Hebert
"... We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D mode ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D models of a scene from a single image. Moreover, we can transfer information across images to accurately label and segment objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-beforeseen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation. By decoupling the viewpoint and the geometry of an image, we develop a scene matching approach which is truly 100% viewpoint invariant, yielding state-of-the-art performance on challenging data. 1.
(Show Context)

Citation Context

...detailed and precise 3D models of a scene from a single image. The problem of monocular 3D scene understanding has recently been gaining tremendous attention from the computer vision community (e.g.: =-=[4, 13, 17, 18, 26, 28, 29, 30, 36, 38]-=-). The common goal of this research is to estimate the full geometry of a scene from a single viewpoint. The ability to infer the geometry of a scene has enabled a variety of applications in both the ...

Discrete-Continuous Depth Estimation from a Single Image

by Miaomiao Liu, Mathieu Salzmann, Xuming He
"... In this paper, we tackle the problem of estimating the depth of a scene from a single image. This is a challeng-ing task, since a single image on its own does not provide any depth cue. To address this, we exploit the availability of a pool of images for which the depth is known. More specifically, ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
In this paper, we tackle the problem of estimating the depth of a scene from a single image. This is a challeng-ing task, since a single image on its own does not provide any depth cue. To address this, we exploit the availability of a pool of images for which the depth is known. More specifically, we formulate monocular depth estimation as a discrete-continuous optimization problem, where the con-tinuous variables encode the depth of the superpixels in the input image, and the discrete ones represent relation-ships between neighboring superpixels. The solution to this discrete-continuous optimization problem is then obtained by performing inference in a graphical model using parti-cle belief propagation. The unary potentials in this graph-ical model are computed by making use of the images with known depth. We demonstrate the effectiveness of our model in both the indoor and outdoor scenarios. Our experimen-tal evaluation shows that our depth estimates are more ac-curate than existing methods on standard datasets. 1.
(Show Context)

Citation Context

...rogress has been made towards accurate 3D scene reconstruction from single images. For instance, simple geometric assumptions (i.e., box models) have proven effective to estimate the layout of a room =-=[9, 17, 27]-=-. Similarly, for outdoor scenes, the Manhattan, or blocks world, assumption has been utilized to perform 3D scene layout estimation [7]. These box models, however, are limited to represent simple stru...

PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding

by Yinda Zhang, Shuran Song, Ping Tan, Jianxiong Xiao
"... Abstract. The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360◦ full-view panoramas in scene understanding, and propose a whole- ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
Abstract. The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360◦ full-view panoramas in scene understanding, and propose a whole-room context model in 3D. For an input panorama, our method outputs 3D bounding boxes of the room and all major objects inside, together with their semantic categories. Our method generates 3D hypotheses based on contextual constraints and ranks the hypotheses holistically, combining both bottom-up and top-down context infor-mation. To train our model, we construct an annotated panorama dataset and re-construct the 3D model from single-view using manual annotation. Experiments show that solely based on 3D context without any image-based object detector, we can achieve a comparable performance with the state-of-the-art object detec-tor. This demonstrates that when the FOV is large, context is as powerful as object appearance. All data and source code are available online. 1

Designing deep networks for surface normal estimation

by Xiaolong Wang, David F. Fouhey, Abhinav Gupta
"... Figure 1: Given a single image, our algorithm estimates the surface normal at each pixel. Notice how our algorithm not only estimates the coarse structure also captures fine local details. For example, on the left, the normals of the couch arm and side table legs are estimated accurately (see zoomed ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Figure 1: Given a single image, our algorithm estimates the surface normal at each pixel. Notice how our algorithm not only estimates the coarse structure also captures fine local details. For example, on the left, the normals of the couch arm and side table legs are estimated accurately (see zoomed version). On the right, the chair surface and legs and even the top of the shopping bags are captured correctly. Normal legend: blue → X; green → Y; red → Z. In the past few years, convolutional neural nets (CNN) have shown incredible promise for learning visual repre-sentations. In this paper, we use CNNs for the task of pre-dicting surface normals from a single image. But what is the right architecture we should use? We propose to build upon the decades of hard work in 3D scene understand-ing, to design new CNN architecture for the task of surface normal estimation. We show by incorporating several con-straints (man-made, manhattan world) and meaningful in-termediate representations (room layout, edge labels) in the architecture leads to state of the art performance on surface normal estimation. We also show that our network is quite robust and show state of the art results on other datasets as well without any fine-tuning. 1.
(Show Context)

Citation Context

...would disregard over five decades of work in 3D scene understanding from the early blocks world [25] and line-labeling [15, 2, 17] work to recent investigations into similar ideas in a data-driven era=-=[11, 22, 9, 28, 36, 7]-=-. Instead in this paper, we want to ask a basic question: are there lessons we have learned from previous research that we can borrow and apply in designing deep networks for the task of surface norma...

Manhattan Junction Catalogue for Spatial Reasoning of Indoor Scenes

by S. Pillai, J. K. Jain, A. Taguchi, Srikumar Ramalingam, Jaishankerk. Pillai Arpit, Jain Yuichi Taguchi , 2013
"... Junctions are strong cues for understanding the geometry of a scene. In this paper, we consider the problem of detecting junctions and using them for recovering the spatial layout of an indoor scene. Junction detection has always been challenging due to missing and spurious lines. We work in a const ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Junctions are strong cues for understanding the geometry of a scene. In this paper, we consider the problem of detecting junctions and using them for recovering the spatial layout of an indoor scene. Junction detection has always been challenging due to missing and spurious lines. We work in a constrained Manhattan world setting where the junctions are formed by only line segments along the three principal orthogonal directions. Junctions can be classified into several categories based on the number and orientations of the incident line segments. We provide a simple and efficient voting scheme to detect and classify these junctions in real images. Indoor scenes are typically modeled as cuboids and we formulate the problem of the layout estimation as an inference problem in a conditional random field. Our formulation allows the incorporation of junction features and the training is done using structured prediction. We outperform other single view geometry estimation methods on standard datasets.
(Show Context)

Citation Context

... of conditional random fields (CRFs) and structured support vector machines (SVMs) to find the optimum solution efficiently. It was also recently shown that such an inference can be performed exactly =-=[27]-=-. Natural statistics priors typically involve long range interactions and such cues can also be incorporated as higher order potentials in CRF-based layout estimation techniques [23, 6]. Our work is m...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University