Results 1 - 10
of
124
Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories
- In CVPR
"... This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting “spatial pyrami ..."
Abstract
-
Cited by 495 (25 self)
- Add to MetaCart
This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting “spatial pyramid ” is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralba’s “gist ” and Lowe’s SIFT descriptors. 1.
Recovering human body configurations: Combining segmentation and recognition
- In CVPR
, 2004
"... localized joints and limbs. (c) Segmentation mask associated with human figure. The goal of this work is to take an image such as the one in Figure 1(a), detect a human figure, and localize his joints and limbs (b) along with their associated pixel masks (c). In this work we attempt to tackle this p ..."
Abstract
-
Cited by 112 (8 self)
- Add to MetaCart
localized joints and limbs. (c) Segmentation mask associated with human figure. The goal of this work is to take an image such as the one in Figure 1(a), detect a human figure, and localize his joints and limbs (b) along with their associated pixel masks (c). In this work we attempt to tackle this problem in a general setting. The dataset we use is a collection of sports news photographs of baseball players, varying dramatically in pose and clothing. The approach that we take is to use segmentation to guide our recognition algorithm to salient bits of the image. We use this segmentation approach to build limb and torso detectors, the outputs of which are assembled into human figures. We present quantitative results on torso localization, in addition to shortlisted full body configurations. 1.
Scene completion using millions of photographs
- ACM Transactions on Graphics (SIGGRAPH
, 2007
"... Figure 1: Given an input image with a missing region, we use matching scenes from a large collection of photographs to complete the image. What can you do with a million images? In this paper we present a new image completion algorithm powered by a huge database of photographs gathered from the Web. ..."
Abstract
-
Cited by 112 (7 self)
- Add to MetaCart
Figure 1: Given an input image with a missing region, we use matching scenes from a large collection of photographs to complete the image. What can you do with a million images? In this paper we present a new image completion algorithm powered by a huge database of photographs gathered from the Web. The algorithm patches up holes in images by finding similar image regions in the database that are not only seamless but also semantically valid. Our chief insight is that while the space of images is effectively infinite, the space of semantically differentiable scenes is actually not that large. For many image completion tasks we are able to find similar scenes which contain image fragments that will convincingly complete the image. Our algorithm is entirely data-driven, requiring no annotations or labelling by the user. Unlike existing image completion methods, our algorithm can generate a diverse set of results for each input image and we allow users to select among them. We demonstrate the superiority of our algorithm over existing image completion approaches.
Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
, 2003
"... Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field ..."
Abstract
-
Cited by 105 (10 self)
- Add to MetaCart
Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification.
Modeling scenes with local descriptors and latent aspects
- In Proc. of IEEE Int. Conf. on Computer Vision
, 2005
"... We present a new approach to model visual scenes in image collections, based on local invariant features and probabilistic latent space models. Our formulation provides answers to three open questions:(1) whether the invariant local features are suitable for scene (rather than object) classification ..."
Abstract
-
Cited by 60 (12 self)
- Add to MetaCart
We present a new approach to model visual scenes in image collections, based on local invariant features and probabilistic latent space models. Our formulation provides answers to three open questions:(1) whether the invariant local features are suitable for scene (rather than object) classification; (2) whether unsupervised latent space models can be used for feature extraction in the classification task; and (3) whether the latent space formulation can discover visual co-occurrence patterns, motivating novel approaches for image organization and segmentation. Using a 9500-image dataset, our approach is validated on each of these issues. First, we show with extensive experiments on binary and multi-class scene classification tasks, that a bag-of-visterm representation, derived from local invariant descriptors, consistently outperforms state-of-theart approaches. Second, we show that Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and significantly more robust when less training data are available. Third, we have exploited the ability of PLSA to automatically extract visually meaningful aspects, to propose new algorithms for aspect-based image ranking and context-sensitive image segmentation. 1.
Learning Spatial Context: Using Stuff to Find Things
"... Abstract. The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally cl ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
Abstract. The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context ” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors. 1
Semantic Place Classification of Indoor Environments With Mobile Robots using Boosting
- in Proc. of the National Conference on Artificial Intelligence (AAAI
, 2005
"... Indoor environments can typically be divided into places with different functionalities like kitchens, offices, or seminar rooms. We believe that such semantic information enables a mobile robot to more efficiently accomplish a variety of tasks such as human-robot interaction, path-planning, or ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
Indoor environments can typically be divided into places with different functionalities like kitchens, offices, or seminar rooms. We believe that such semantic information enables a mobile robot to more efficiently accomplish a variety of tasks such as human-robot interaction, path-planning, or localization. This paper presents a supervised learning approach to label different locations using boosting. We train a classifier using features extracted from vision and laser range data. Furthermore, we apply a Hidden Markov Model to increase the robustness of the final classification. Our technique has been implemented and tested on real robots as well as in simulation. The experiments demonstrate that our approach can be utilized to robustly classify places into semantic categories. We also present an example of localization using semantic labeling.
Contextual Recognition of Head Gestures
- PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES (ICMI'05
, 2005
"... Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. We investigate how dialog context from an embodied conversational agent (ECA) can improve visual recognition of user gestures. We present a recogntion framework wh ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. We investigate how dialog context from an embodied conversational agent (ECA) can improve visual recognition of user gestures. We present a recogntion framework which (1) extracts contextual features from an ECA´s dialog manager, (2) computes a predicition of head nod and head shakes, and (3) integrates the contextual predictions with the visual observation of a vision-based head gesture recognizer. We found a subset of lexical, punctuation and timing features that are easily available in most ECA architectures and can be used to learn how to predict user feedback. Using a discriminative approach to contextual prediction and multi-modal integration, we were able to improve the performancae of head gesture detection even when the topic of the test set was significantly different than the training set.
Supervised semantic labeling of places using information extracted from sensor data
- Robotics and Autonomous Systems
, 2007
"... Abstract — Indoor environments can typically be divided into places with different functionalities like corridors, kitchens, offices, or seminar rooms. The ability to learn such semantic categories from sensor data enables a mobile robot to extend the representation of the environment facilitating t ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Abstract — Indoor environments can typically be divided into places with different functionalities like corridors, kitchens, offices, or seminar rooms. The ability to learn such semantic categories from sensor data enables a mobile robot to extend the representation of the environment facilitating the interaction with humans. As an example, natural language terms like “corridor” or “room ” can be used to communicate the position of the robot in a map in a more intuitive way. In this work, we first propose an approach based on supervised learning to classify the pose of a mobile robot into semantic classes. Our method uses AdaBoost to boost simple features extracted from range data and vision into a strong classifier. We present two main applications of this approach. Firstly, we show how our approach can be utilized by a moving robot for an online classification of the poses traversed along its path using a hidden Markov model. Secondly, we introduce an approach to learn topological maps from geometric maps by applying our semantic classification procedure in combination with a probabilistic relaxation procedure. We finally show how to apply associative Markov networks (AMNs) together with AdaBoost for classifying complete geometric maps. Experimental results obtained in simulation and with real robots demonstrate the effectiveness of our approach in various indoor environments. I.
Reduced sift features for image retrieval and indoor localisation
- In Australian Conference on Robotics and Automation
, 2004
"... SIFT features are distinctive invariant features used to robustly describe and match digital image content between different views of a scene. While invariant to scale and rotation, and robust to other image transforms, the SIFT feature description of an image is typically large and slow to compute. ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
SIFT features are distinctive invariant features used to robustly describe and match digital image content between different views of a scene. While invariant to scale and rotation, and robust to other image transforms, the SIFT feature description of an image is typically large and slow to compute. This paper presents a method to reduce the size, complexity and matching time of SIFT feature sets for use in indoor image retrieval and robot localisation. Our method takes advantage of the structure of typical indoor environments to reduce the complexity of each SIFT feature and the number of SIFT features required to describe a scene.

