• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Learning Rich Features from RGB-D Images for Object Detection and Segmentation. arXiv preprint arXiv:1407.5736, (2014)

by S Gupta, R Girshick, P Arbelaez, J Malik
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 31
Next 10 →

Fully convolutional networks for semantic segmentation

by Jonathan Long, Evan Shelhamer, Trevor Darrell , 2014
"... Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take ..."
Abstract - Cited by 37 (0 self) - Add to MetaCart
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolu-tional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [17], the VGG net [28], and GoogLeNet [29]) into fully convolu-tional networks and transfer their learned representations by fine-tuning [2] to the segmentation task. We then de-fine a novel architecture that combines semantic informa-tion from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20 % rela-tive improvement to 62.2 % mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image. 1.
(Show Context)

Citation Context

...al correspondence [22, 7]. The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation =-=[24, 1, 6, 25, 14, 12, 9]-=-, in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses. ∗Authors contributed equally 96 384 256 40 964096 21 21 backward/lear...

Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” arXiv preprint arXiv:1411.4734

by David Eigen, Rob Fergus , 2014
"... In this paper we address three different computer vision tasks using a single multiscale convolutional network archi-tecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifica-tions, regre ..."
Abstract - Cited by 7 (1 self) - Add to MetaCart
In this paper we address three different computer vision tasks using a single multiscale convolutional network archi-tecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifica-tions, regressing from the input image to the output map di-rectly. Our method progressively refines predictions using a sequence of scales, and captures many image details with-out any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks. 1.
(Show Context)

Citation Context

...g boxes for a few objects in each scene. However, ConvNets have recently been applied to a variety of other tasks, including pose estimation [36, 27], stereo depth [38, 25], and instance segmentation =-=[14]-=-. Most of these systems use ConvNets to find only local features, or generate descriptors of discrete proposal regions; by contrast, our network uses both local and global views to predict a variety o...

Robo brain: Large-scale knowledge engine for robots

by Ashutosh Saxena , Ashesh Jain , Ozan Sener , Aditya Jami , K Dipendra , Hema S Misra , Koppula , 2014
"... Abstract-In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, ro ..."
Abstract - Cited by 6 (6 self) - Add to MetaCart
Abstract-In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, robot trajectories, visual features and many others. The knowledge stored in the engine comes from multiple sources including physical interactions that robots have while performing tasks (perception, planning and control), knowledge bases from WWW and learned representations from leading robotics research groups. We discuss various technical aspects and associated challenges such as modeling the correctness of knowledge, inferring latent information and formulating different robotic tasks as queries to the knowledge engine. We describe the system architecture and how it supports different mechanisms for users and robots to interact with the engine. Finally, we demonstrate its use in three important research areas: grounding natural language, perception, and planning, which are the key building blocks for many robotic tasks. This knowledge engine is a collaborative effort and we call it RoboBrain.
(Show Context)

Citation Context

...otics problems where the knowledge needs to be about entities in the physical world and of various modalities. Robotics Works. For robots to operate autonomously they should perceive our environments, plan paths, manipulate objects and interact with humans. This is very challenging because solution to each sub-problem varies with the task, human preference and the environment context. We now briefly describe previous work in each of these areas. Perceiving the environment. Perception is a key element of many robotic tasks. It has been applied to object labeling [37, 2, 60], scene segmentation [23], robot localization [43, 46], feature extraction for planning and manipulation [30], understanding environment constraints [28] and object affordances [12, 34]. Sharing representations from visual information not only improve the performance of each of the perception tasks, but also significantly help various applications such as autonomous or assistive driving [14, 8], anticipation [31, 35, 59], planning sociable paths [36] and for various household chores such as grasping and cutting [41]. Sharing representations from other modalities such as sound [52] and haptics [21] would also improve p...

Recognize complex events from static images by fusing deep channels

by Yuanjun Xiong, Kai Zhu, Dahua Lin, Xiaoou Tang - in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 2015
"... A considerable portion of web images capture events that occur in our personal lives or social activities. In this pa-per, we aim to develop an effective method for recogniz-ing events from such images. Despite the sheer amount of study on event recognition, most existing methods rely on videos and ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
A considerable portion of web images capture events that occur in our personal lives or social activities. In this pa-per, we aim to develop an effective method for recogniz-ing events from such images. Despite the sheer amount of study on event recognition, most existing methods rely on videos and are not directly applicable to this task. Gen-erally, events are complex phenomena that involve interac-tions among people and objects, and therefore analysis of event photos requires techniques that can go beyond rec-ognizing individual objects and carry out joint reasoning based on evidences of multiple aspects. Inspired by the re-cent success of deep learning, we formulate a multi-layer framework to tackle this problem, which takes into account both visual appearance and the interactions among humans and objects, and combines them via semantic fusion. An important issue arising here is that humans and objects dis-covered by detectors are in the form of bounding boxes, and there is no straightforward way to represent their interac-tions and incorporate them with a deep network. We ad-dress this using a novel strategy that projects the detected instances onto multi-scale spatial maps. On a large dataset with 60, 000 images, the proposed method achieved sub-stantial improvement over the state-of-the-art, raising the accuracy of event recognition by over 10%. 1.
(Show Context)

Citation Context

...nels. Following the recent success of deep models [6, 12, 20], attempts [23, 30] have been made to connect multiple modalities through deep networks. In recent work, auxiliary channels, such as depth =-=[8]-=- and optical flow [29], are captured using additional networks. It is worth emphasizing that depth maps or optical flows are both spatial maps by nature and thus it is relatively easy to construct CNN...

Aligning 3D models to RGB-D images of cluttered scenes

by Saurabh Gupta, Ross Girshick, Jitendra Malik - In: CVPR , 2015
"... eecs.berkeley.edu uniandes.edu.co microsoft.com eecs.berkeley.edu The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We ap-proach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neur ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
eecs.berkeley.edu uniandes.edu.co microsoft.com eecs.berkeley.edu The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We ap-proach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neural network (CNN) to predict the pose of the object. This CNN is trained using pixel surface normals in images containing renderings of synthetic objects. When tested on real data, our method outperforms alternative algorithms trained on real data. We then use this coarse pose estimate along with the inferred pixel support to align a small number of pro-totypical models to the data, and place into the scene the model that fits best. We observe a 48 % relative improve-ment in performance at the task of 3D detection over the current state-of-the-art [34], while being an order of mag-nitude faster. 1.
(Show Context)

Citation Context

.... We believe such an output representation will enable the use of perception in fields like robotics. Figure 2 describes our approach. We first use the output of the detection and segmentation system =-=[13]-=-, and infer the pose of each detected object using a neural network. We train this CNN on synthetic data using surface normal images instead of depth images as input. We show that this CNN trained on ...

SUN RGB-D: A RGBD scene understanding benchmark suite

by Shuran Song, Samuel P. Lichtenberg, Jianxiong Xiao - In CVPR , 2015
"... Although RGB-D sensors have enabled major break-throughs for several vision tasks, such as 3D reconstruc-tion, we have not attained the same level of success in high-level scene understanding. Perhaps one of the main rea-sons is the lack of a large-scale benchmark with 3D anno-tations and 3D evaluat ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Although RGB-D sensors have enabled major break-throughs for several vision tasks, such as 3D reconstruc-tion, we have not attained the same level of success in high-level scene understanding. Perhaps one of the main rea-sons is the lack of a large-scale benchmark with 3D anno-tations and 3D evaluation metrics. In this paper, we intro-duce an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks. Our dataset is captured by four different sensors and con-tains 10,335 RGB-D images, at a similar scale as PASCAL VOC. The whole dataset is densely annotated and includes 146,617 2D polygons and 64,595 3D bounding boxes with accurate object orientations, as well as a 3D room layout and scene category for each image. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias. 1.
(Show Context)

Citation Context

...sks, such as body pose recognition [56, 58], intrinsic image estimation [4], 3D modeling [27] and SfM reconstruction [72]. RGB-D sensors have also enabled rapid progress for scene understanding (e.g. =-=[20, 19, 53, 38, 30, 17, 32, 49]-=-). However, while we can crawl color images from the Internet easily, it is not possible to obtain large-scale RGB-D data online. Consequently, the existing RGB-D recognition benchmarks, such as NYU D...

RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features

by Max Schwarz, Hannes Schulz, Sven Behnke - in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA , 2015
"... Abstract — Object recognition and pose estimation from RGB-D images are important tasks for manipulation robots which can be learned from examples. Creating and annotating datasets for learning is expensive, however. We address this problem with transfer learning from deep convolutional neural netwo ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Abstract — Object recognition and pose estimation from RGB-D images are important tasks for manipulation robots which can be learned from examples. Creating and annotating datasets for learning is expensive, however. We address this problem with transfer learning from deep convolutional neural networks (CNN) that are pre-trained for image categorization and provide a rich, semantically meaningful feature set. We incorporate depth information, which the CNN was not trained with, by rendering objects from a canonical perspective and colorizing the depth channel according to distance from the object center. We evaluate our approach on the Washington RGB-D Objects dataset, where we find that the generated feature set naturally separates classes and instances well and retains pose manifolds. We outperform state-of-the-art on a number of subtasks and show that our approach can yield superior results when only little training data is available. I.
(Show Context)

Citation Context

...n a robotic setting with few labeled instances. We use pre-trained CNN in conjunction with preprocessed depth images, which is not addressed by the works discussed so far. Very recently, Gupta et al. =-=[11]-=- proposed a similar technique, where “color” channels are given by horizontal disparity, height above ground, and angle with vertical. In contrast to their method, we propose an object-centered colori...

1Region-based Convolutional Networks for Accurate Object Detection and Segmentation

by Ross Girshick, Jeff Donahue, Student Member, Trevor Darrell, Jitendra Malik
"... Abstract—Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this pap ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Abstract—Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50 % relative to the previous best result on VOC 2012—achieving a mAP of 62.4%. Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available at
(Show Context)

Citation Context

...n be used effectively for traditional bounding-box detection as well as semantic segmentation. Their PASCAL segmentation results improve significantly on the ones reported in this paper. Gupta et al. =-=[46]-=- extend R-CNNs to object detection in depth images. They show that a well-designed input signal, where the depth map is augmented with height above ground and local surface orientation with respect to...

Depth-based hand pose estimation: data, methods, and challenges

by James Steven, Grégory Rogez, Yi Yang, Jamie Shotton, Deva Ramanan
"... Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new ad-vances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new ad-vances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable num-ber of systems, and will release all software and evaluation code. We summarize important conclusions here: (1) Pose estimation appears roughly solved for scenes with isolated hands. However, methods still struggle to analyze cluttered scenes where hands may be interacting with nearby ob-jects and surfaces. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate cri-teria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experi-ments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that train-ing data is as important as the model itself. We conclude with directions for future progress. 1.
(Show Context)

Citation Context

...n historically popular. Typically, the depth image is segmented with simple morphological operations [23] or the RGB image is segmented with skin classifiers [40]. While RGB features compliment depth =-=[11, 26]-=-, skin segmentation appears difficult to generalize across subjects and scenes with varying lighting [25]. We evaluate a depth-based segmentation system [12] for completeness. 4.2. Architectures In th...

Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images

by Dan Banica, Cristian Sminchisescu - In Proceedings of Computer Vision and Pattern Recognition. IEEE , 2015
"... We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered in-door scenes containing many visual categories and in-stances. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layo ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered in-door scenes containing many visual categories and in-stances. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layout hypotheses at multiple loca-tions and scales in the image followed by a sequential in-ference algorithm that produces a complete scene estimate. Our contributions can be summarized as follows: (1) a gen-eralization of parametric max flow figure-ground proposal methodology to take advantage of intensity and depth in-formation, in order to systematically and efficiently gen-erate the breakpoints of an underlying spatial model in polynomial time, (2) new region description methods based on second-order pooling over multiple features constructed using both intensity and depth channels, (3) a principled search-based structured prediction inference and learning process that resolves conflicts in overlapping spatial par-titions and selects regions sequentially towards complete scene estimates, and (4) extensive evaluation of the impact of depth, as well as the effectiveness of a large number of descriptors, both pre-designed and automatically obtained using deep learning, in a difficult RGB-D semantic segmen-tation problem with 92 classes. We report state of the art results in the challenging NYU Depth Dataset V2 [44], ex-tended for the RMRC 2013 and RMRC 2014 Indoor Seg-mentation Challenges, where currently the proposed model ranks first. Moreover, we show that by combining second-order and deep learning features, over 15 % relative ac-curacy improvements can be additionally achieved. In a scene classification benchmark, our methodology further improves the state of the art by 24%. 1. Introduction and Related
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University