Results 1 - 10
of
31
Fully convolutional networks for semantic segmentation
, 2014
"... Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
(Show Context)
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolu-tional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [17], the VGG net [28], and GoogLeNet [29]) into fully convolu-tional networks and transfer their learned representations by fine-tuning [2] to the segmentation task. We then de-fine a novel architecture that combines semantic informa-tion from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20 % rela-tive improvement to 62.2 % mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image. 1.
Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” arXiv preprint arXiv:1411.4734
, 2014
"... In this paper we address three different computer vision tasks using a single multiscale convolutional network archi-tecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifica-tions, regre ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
In this paper we address three different computer vision tasks using a single multiscale convolutional network archi-tecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifica-tions, regressing from the input image to the output map di-rectly. Our method progressively refines predictions using a sequence of scales, and captures many image details with-out any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks. 1.
Robo brain: Large-scale knowledge engine for robots
, 2014
"... Abstract-In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, ro ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
(Show Context)
Abstract-In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, robot trajectories, visual features and many others. The knowledge stored in the engine comes from multiple sources including physical interactions that robots have while performing tasks (perception, planning and control), knowledge bases from WWW and learned representations from leading robotics research groups. We discuss various technical aspects and associated challenges such as modeling the correctness of knowledge, inferring latent information and formulating different robotic tasks as queries to the knowledge engine. We describe the system architecture and how it supports different mechanisms for users and robots to interact with the engine. Finally, we demonstrate its use in three important research areas: grounding natural language, perception, and planning, which are the key building blocks for many robotic tasks. This knowledge engine is a collaborative effort and we call it RoboBrain.
Recognize complex events from static images by fusing deep channels
- in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, 2015
"... A considerable portion of web images capture events that occur in our personal lives or social activities. In this pa-per, we aim to develop an effective method for recogniz-ing events from such images. Despite the sheer amount of study on event recognition, most existing methods rely on videos and ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
A considerable portion of web images capture events that occur in our personal lives or social activities. In this pa-per, we aim to develop an effective method for recogniz-ing events from such images. Despite the sheer amount of study on event recognition, most existing methods rely on videos and are not directly applicable to this task. Gen-erally, events are complex phenomena that involve interac-tions among people and objects, and therefore analysis of event photos requires techniques that can go beyond rec-ognizing individual objects and carry out joint reasoning based on evidences of multiple aspects. Inspired by the re-cent success of deep learning, we formulate a multi-layer framework to tackle this problem, which takes into account both visual appearance and the interactions among humans and objects, and combines them via semantic fusion. An important issue arising here is that humans and objects dis-covered by detectors are in the form of bounding boxes, and there is no straightforward way to represent their interac-tions and incorporate them with a deep network. We ad-dress this using a novel strategy that projects the detected instances onto multi-scale spatial maps. On a large dataset with 60, 000 images, the proposed method achieved sub-stantial improvement over the state-of-the-art, raising the accuracy of event recognition by over 10%. 1.
Aligning 3D models to RGB-D images of cluttered scenes
- In: CVPR
, 2015
"... eecs.berkeley.edu uniandes.edu.co microsoft.com eecs.berkeley.edu The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We ap-proach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neur ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
eecs.berkeley.edu uniandes.edu.co microsoft.com eecs.berkeley.edu The goal of this work is to represent objects in an RGB-D scene with corresponding 3D models from a library. We ap-proach this problem by first detecting and segmenting object instances in the scene and then using a convolutional neural network (CNN) to predict the pose of the object. This CNN is trained using pixel surface normals in images containing renderings of synthetic objects. When tested on real data, our method outperforms alternative algorithms trained on real data. We then use this coarse pose estimate along with the inferred pixel support to align a small number of pro-totypical models to the data, and place into the scene the model that fits best. We observe a 48 % relative improve-ment in performance at the task of 3D detection over the current state-of-the-art [34], while being an order of mag-nitude faster. 1.
SUN RGB-D: A RGBD scene understanding benchmark suite
- In CVPR
, 2015
"... Although RGB-D sensors have enabled major break-throughs for several vision tasks, such as 3D reconstruc-tion, we have not attained the same level of success in high-level scene understanding. Perhaps one of the main rea-sons is the lack of a large-scale benchmark with 3D anno-tations and 3D evaluat ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Although RGB-D sensors have enabled major break-throughs for several vision tasks, such as 3D reconstruc-tion, we have not attained the same level of success in high-level scene understanding. Perhaps one of the main rea-sons is the lack of a large-scale benchmark with 3D anno-tations and 3D evaluation metrics. In this paper, we intro-duce an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks. Our dataset is captured by four different sensors and con-tains 10,335 RGB-D images, at a similar scale as PASCAL VOC. The whole dataset is densely annotated and includes 146,617 2D polygons and 64,595 3D bounding boxes with accurate object orientations, as well as a 3D room layout and scene category for each image. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias. 1.
RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features
- in Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA
, 2015
"... Abstract — Object recognition and pose estimation from RGB-D images are important tasks for manipulation robots which can be learned from examples. Creating and annotating datasets for learning is expensive, however. We address this problem with transfer learning from deep convolutional neural netwo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract — Object recognition and pose estimation from RGB-D images are important tasks for manipulation robots which can be learned from examples. Creating and annotating datasets for learning is expensive, however. We address this problem with transfer learning from deep convolutional neural networks (CNN) that are pre-trained for image categorization and provide a rich, semantically meaningful feature set. We incorporate depth information, which the CNN was not trained with, by rendering objects from a canonical perspective and colorizing the depth channel according to distance from the object center. We evaluate our approach on the Washington RGB-D Objects dataset, where we find that the generated feature set naturally separates classes and instances well and retains pose manifolds. We outperform state-of-the-art on a number of subtasks and show that our approach can yield superior results when only little training data is available. I.
1Region-based Convolutional Networks for Accurate Object Detection and Segmentation
"... Abstract—Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this pap ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 50 % relative to the previous best result on VOC 2012—achieving a mAP of 62.4%. Our approach combines two ideas: (1) one can apply high-capacity convolutional networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data are scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, boosts performance significantly. Since we combine region proposals with CNNs, we call the resulting model an R-CNN or Region-based Convolutional Network. Source code for the complete system is available at
Depth-based hand pose estimation: data, methods, and challenges
"... Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new ad-vances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new ad-vances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable num-ber of systems, and will release all software and evaluation code. We summarize important conclusions here: (1) Pose estimation appears roughly solved for scenes with isolated hands. However, methods still struggle to analyze cluttered scenes where hands may be interacting with nearby ob-jects and surfaces. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate cri-teria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experi-ments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that train-ing data is as important as the model itself. We conclude with directions for future progress. 1.
Second-order constrained parametric proposals and sequential search-based structured prediction for semantic segmentation in RGB-D images
- In Proceedings of Computer Vision and Pattern Recognition. IEEE
, 2015
"... We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered in-door scenes containing many visual categories and in-stances. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layo ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We focus on the problem of semantic segmentation based on RGB-D data, with emphasis on analyzing cluttered in-door scenes containing many visual categories and in-stances. Our approach is based on a parametric figure-ground intensity and depth-constrained proposal process that generates spatial layout hypotheses at multiple loca-tions and scales in the image followed by a sequential in-ference algorithm that produces a complete scene estimate. Our contributions can be summarized as follows: (1) a gen-eralization of parametric max flow figure-ground proposal methodology to take advantage of intensity and depth in-formation, in order to systematically and efficiently gen-erate the breakpoints of an underlying spatial model in polynomial time, (2) new region description methods based on second-order pooling over multiple features constructed using both intensity and depth channels, (3) a principled search-based structured prediction inference and learning process that resolves conflicts in overlapping spatial par-titions and selects regions sequentially towards complete scene estimates, and (4) extensive evaluation of the impact of depth, as well as the effectiveness of a large number of descriptors, both pre-designed and automatically obtained using deep learning, in a difficult RGB-D semantic segmen-tation problem with 92 classes. We report state of the art results in the challenging NYU Depth Dataset V2 [44], ex-tended for the RMRC 2013 and RMRC 2014 Indoor Seg-mentation Challenges, where currently the proposed model ranks first. Moreover, we show that by combining second-order and deep learning features, over 15 % relative ac-curacy improvements can be additionally achieved. In a scene classification benchmark, our methodology further improves the state of the art by 24%. 1. Introduction and Related