Results 1 - 10
of
26
Robotic Grasping of Novel Objects using Vision
"... We consider the problem of grasping novel objects, specifically ones that are being seen for the first time through vision. Grasping a previously unknown object, one for which a 3-d model is not available, is a challenging problem. Further, even if given a model, one still has to decide where to gra ..."
Abstract
-
Cited by 57 (9 self)
- Add to MetaCart
We consider the problem of grasping novel objects, specifically ones that are being seen for the first time through vision. Grasping a previously unknown object, one for which a 3-d model is not available, is a challenging problem. Further, even if given a model, one still has to decide where to grasp the object. We present a learning algorithm that neither requires, nor tries to build, a 3-d model of the object. Given two (or more) images of an object, our algorithm attempts to identify a few points in each image corresponding to good locations at which to grasp the object. This sparse set of points is then triangulated to obtain a 3-d location at which to attempt a grasp. This is in contrast to standard dense stereo, which tries to triangulate every single point in an image (and often fails to return a good 3-d model). Our algorithm for identifying grasp locations from an image is trained via supervised learning, using synthetic images for the training set. We demonstrate this approach on two robotic manipulation platforms. Our algorithm successfully grasps a wide variety of objects, such as plates, tape-rolls, jugs, cellphones, keys, screwdrivers, staplers, a thick coil of wire, a strangely shaped power horn, and others, none of which were seen in the training set. We also apply our method to the task of unloading items from dishwashers. 1 1
3-D depth reconstruction from a single still image
, 2006
"... We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc ..."
Abstract
-
Cited by 38 (12 self)
- Add to MetaCart
We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone.
Learning Spatial Context: Using Stuff to Find Things
"... Abstract. The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally cl ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
Abstract. The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context ” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors. 1
Make3D: Learning 3D Scene Structure from a Single Still Image
"... We consider the problem of estimating detailed 3-d structure from a single still image of an unstructured environment. Our goal is to create 3-d models which are both quantitatively accurate as well as visually pleasing. For each small homogeneous patch in the image, we use a Markov Random Field (M ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
We consider the problem of estimating detailed 3-d structure from a single still image of an unstructured environment. Our goal is to create 3-d models which are both quantitatively accurate as well as visually pleasing. For each small homogeneous patch in the image, we use a Markov Random Field (MRF) to infer a set of “plane parameters” that capture both the 3-d location and 3-d orientation of the patch. The MRF, trained via supervised learning, models both image depth cues as well as the relationships between different parts of the image. Other than assuming that the environment is made up of a number of small planes, our model makes no explicit assumptions about the structure of the scene; this enables the algorithm to capture much more detailed 3-d structure than does prior art, and also give a much richer experience in the 3-d flythroughs created using image-based rendering, even for scenes with significant non-vertical structure. Using this approach, we have created qualitatively correct 3-d models for 64.9 % of 588 images downloaded from the internet. We have also extended our model to produce large scale 3d models from a few images.
Decomposing a Scene into Geometric and Semantically Consistent Regions
"... High-level, or holistic, scene understanding involves reasoning about objects, regions, and the 3D relationships between them. This requires a representation above the level of pixels that can be endowed with high-level attributes such as class of object/region, its orientation, and (rough 3D) locat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
High-level, or holistic, scene understanding involves reasoning about objects, regions, and the 3D relationships between them. This requires a representation above the level of pixels that can be endowed with high-level attributes such as class of object/region, its orientation, and (rough 3D) location within the scene. Towards this goal, we propose a region-based model which combines appearance and scene geometry to automatically decompose a scene into semantically meaningful regions. Our model is defined in terms of a unified energy function over scene appearance and structure. We show how this energy function can be learned from data and present an efficient inference technique that makes use of multiple over-segmentations of the image to propose moves in the energy-space. We show, experimentally, that our method achieves state-of-the-art performance on the tasks of both multi-class image segmentation and geometric reasoning. Finally, by understanding region classes and geometry, we show how our model can be used as the basis for 3D reconstruction of the scene. 1.
Cascaded Classification Models: Combining Models for Holistic Scene Understanding
"... One of the original goals of computer vision was to fully understand a natural scene. This requires solving several sub-problems simultaneously, including object detection, region labeling, and geometric reasoning. The last few decades have seen great progress in tackling each of these problems in i ..."
Abstract
-
Cited by 20 (10 self)
- Add to MetaCart
One of the original goals of computer vision was to fully understand a natural scene. This requires solving several sub-problems simultaneously, including object detection, region labeling, and geometric reasoning. The last few decades have seen great progress in tackling each of these problems in isolation. Only recently have researchers returned to the difficult task of considering them jointly. In this work, we consider learning a set of related models in such that they both solve their own problem and help each other. We develop a framework called Cascaded Classification Models (CCM), where repeated instantiations of these classifiers are coupled by their input/output variables in a cascade that improves performance at each level. Our method requires only a limited “black box ” interface with the models, allowing us to use very sophisticated, state-of-the-art classifiers without having to look under the hood. We demonstrate the effectiveness of our method on a large set of natural images by combining the subtasks of scene categorization, object detection, multiclass image segmentation, and 3d reconstruction. 1
Closing the Loop in Scene Interpretation
"... Image understanding involves analyzing many different aspects of the scene. In this paper, we are concerned with how these tasks can be combined in a way that improves the performance of each of them. Inspired by Barrow and Tenenbaum, we present a flexible framework for interfacing scene analysis pr ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Image understanding involves analyzing many different aspects of the scene. In this paper, we are concerned with how these tasks can be combined in a way that improves the performance of each of them. Inspired by Barrow and Tenenbaum, we present a flexible framework for interfacing scene analysis processes using intrinsic images. Each intrinsic image is a registered map describing one characteristic of the scene. We apply this framework to develop an integrated 3D scene understanding system with estimates of surface orientations, occlusion boundaries, objects, camera viewpoint, and relative depth. Our experiments on a set of 300 outdoor images demonstrate that these tasks reinforce each other, and we illustrate a coherent scene understanding with automatically reconstructed 3D models. 1.
Learning to Open New Doors
"... Abstract — As robots enter novel, uncertain home and office environments, they are able to navigate these environments successfully. However, to be practically deployed, robots should be able to manipulate their environment to gain access to new spaces, such as by opening a door and operating an ele ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract — As robots enter novel, uncertain home and office environments, they are able to navigate these environments successfully. However, to be practically deployed, robots should be able to manipulate their environment to gain access to new spaces, such as by opening a door and operating an elevator. This, however, remains a challenging problem because a robot will likely encounter doors (and elevators) it has never seen before. Objects such as door handles are very different in appearance, yet similar function implies similar form. These general, shared visual features can be extracted to provide a robot with the necessary information to manipulate the specific object and carry out a task. For example, opening a door requires the robot to identify the following properties: (a) location of the door handle axis of rotation, (b) size of the handle, and (c) type of handle (leftturn or right-turn). Given these keypoints, the robot can plan the sequence of control actions required to successfully open the door. We identify these “visual keypoints ” using vision-based learning algorithms. Our system assumes no prior knowledge of the 3D location or shape of the door handle. By experimentally verifying our algorithms on doors not seen in the training set, we advance our work towards being the first to enable a robot to navigate to more spaces in a new building by opening doors and elevators, even ones it has not seen before. I.
Building a database of 3D scenes from user annotations
"... Inthispaper, wewishtobuildahighqualitydatabaseof images depicting scenes, along with their real-world threedimensional (3D) coordinates. Such a database is useful for a variety of applications, including training systems for objectdetectionandvalidationof3Doutput. We build such adatabasefromimagesth ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Inthispaper, wewishtobuildahighqualitydatabaseof images depicting scenes, along with their real-world threedimensional (3D) coordinates. Such a database is useful for a variety of applications, including training systems for objectdetectionandvalidationof3Doutput. We build such adatabasefromimagesthathavebeenannotatedwithonly theidentityofobjectsandtheirspatialextentinimages. Importantforthistaskistherecoveryofgeometricinformation thatisimplicitintheobjectlabels,suchasqualtitative relationships between objects (attachment, support, occlusion) and quantitative ones (inferring camera parameters). We describeamodelthatintegratescuesextracted from the object labels to infer the implicit geometric information. We show that we are able to obtain high quality 3D information by evaluating the proposed approach on a database
Semi-automatic Stereo Extraction from Video Footage
"... We present a semi-automatic system that converts conventional video shots to stereoscopic video pairs. The system requires just a few user-scribbles in a sparse set of frames. The system combines a diffusion scheme, which takes into account the local saliency and the local motion at each video locat ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
We present a semi-automatic system that converts conventional video shots to stereoscopic video pairs. The system requires just a few user-scribbles in a sparse set of frames. The system combines a diffusion scheme, which takes into account the local saliency and the local motion at each video location, coupled with a classification scheme that assigns depth to image patches. The system tolerates both scene motion and camera motion. In typical shots, containing hundreds of frames, even in the face of significant motion, it is enough to mark scribbles on the first and last frames of the shot. Once marked, plausible stereo results are obtained in a matter of seconds, leading to a scalable video conversion system. Finally, we validate our results with ground truth stereo video. 1.

