Results 1 - 10
of
59
Recognition using visual phrases
- in Proceedings of CVPR, 2011
"... In this paper we introduce visual phrases, complex visual composites like “a person riding a horse”. Visual phrases often display significantly reduced visual complexity compared to their component objects, because the appearance of those objects can change profoundly when they participate in relati ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
In this paper we introduce visual phrases, complex visual composites like “a person riding a horse”. Visual phrases often display significantly reduced visual complexity compared to their component objects, because the appearance of those objects can change profoundly when they participate in relations. We introduce a dataset suitable for phrasal recognition that uses familiar PASCAL object categories, and demonstrate significant experimental gains resulting from exploiting visual phrases. We show that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects. We argue that any multi-class detection system must decode detector outputs to produce final results; this is usually done with nonmaximum suppression. We describe a novel decoding procedure that can account accurately for local context without solving difficult inference problems. We show this decoding procedure outperforms the state of the art. Finally, we show that decoding a combination of phrasal and object detectors produces real improvements in detector results. 1.
A Discriminative Latent Model of Object Classes and Attributes
"... Abstract. We present a discriminatively trained model for joint modelling of object class labels (e.g. “person”, “dog”, “chair”, etc.) and their visual attributes (e.g. “has head”, “furry”, “metal”, etc.). We treat attributes of an object as latent variables in our model and capture the correlations ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
Abstract. We present a discriminatively trained model for joint modelling of object class labels (e.g. “person”, “dog”, “chair”, etc.) and their visual attributes (e.g. “has head”, “furry”, “metal”, etc.). We treat attributes of an object as latent variables in our model and capture the correlations among attributes using an undirected graphical model built from training data. The advantage of our model is that it allows us to infer object class labels using the information of both the test image itself and its (latent) attributes. Our model unifies object class prediction and attribute prediction in a principled framework. It is also flexible enough to deal with different performance measurements. Our experimental results provide quantitative evidence that attributes can improve object naming. 1
Unbiased look at dataset bias
- in CVPR
, 2011
"... Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same t ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Datasets are an integral part of contemporary object recognition research. They have been the chief reason for the considerable progress in the field, not just as source of large amounts of training data, but also as means of measuring and comparing performance of competing algorithms. At the same time, datasets have often been blamed for narrowing the focus of object recognition research, reducing it to a single benchmark performance number. Indeed, some datasets, that started out as data capture efforts aimed at representing the visual world, have become closed worlds unto themselves (e.g. the Corel world, the Caltech-101 world, the PASCAL VOC world). With the focus on beating the latest benchmark numbers on the latest dataset, have we perhaps lost sight of the original purpose? The goal of this paper is to take stock of the current state of recognition datasets. We present a comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value. The experimental results, some rather surprising, suggest directions that can improve dataset collection as well as algorithm evaluation protocols. But more broadly, the hope is to stimulate discussion in the community regarding this very important, but largely neglected issue. 1.
Action Recognition from a Distributed Representation of Pose and Appearance
"... We present a distributed representation of pose and appearance of people called the “poselet activation vector”. First we show that this representation can be used to estimate the pose of people defined by the 3D orientations of the head and torso in the challenging PASCAL VOC 2010 person detection ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We present a distributed representation of pose and appearance of people called the “poselet activation vector”. First we show that this representation can be used to estimate the pose of people defined by the 3D orientations of the head and torso in the challenging PASCAL VOC 2010 person detection dataset. Our method is robust to clutter, aspect and viewpoint variation and works even when body parts like faces and limbs are occluded or hard to localize. We combine this representation with other sources of information like interaction with objects and other people in the image and use it for action recognition. We report competitive results on the PASCAL VOC 2010 static image action classification challenge. 1.
Ensemble of Exemplar-SVMs for Object Detection and Beyond
"... This paper proposes a conceptually simple but surprisingly powerful method which combines the effectiveness of a discriminative object detector with the explicit correspondence offered by a nearest-neighbor approach. The method is based on training a separate linear SVM classifier for every exemplar ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
This paper proposes a conceptually simple but surprisingly powerful method which combines the effectiveness of a discriminative object detector with the explicit correspondence offered by a nearest-neighbor approach. The method is based on training a separate linear SVM classifier for every exemplar in the training set. Each of these Exemplar-SVMs is thus defined by a single positive instance and millions of negatives. While each detector is quite specific to its exemplar, we empirically observe that an ensemble of such Exemplar-SVMs offers surprisingly good generalization. Our performance on the PASCAL VOC detection task is on par with the much more complex latent part-based model of Felzenszwalb et al., at only a modest computational cost increase. But the central benefit of our approach is that it creates an explicit association between each detection and a single training exemplar. Because most detections show good alignment to their associated exemplar, it is possible to transfer any available exemplar meta-data (segmentation, geometric structure, 3D model, etc.) directly onto the detections, which can then be used as part of overall scene understanding. 1.
Shared Parts for Deformable Part-based Models
"... The deformable part-based model (DPM) proposed by Felzenszwalb et al. has demonstrated state-of-the-art results in object localization. The model offers a high degree of learnt invariance by utilizing viewpoint-dependent mixture components and movable parts in each mixture component. One might hope ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The deformable part-based model (DPM) proposed by Felzenszwalb et al. has demonstrated state-of-the-art results in object localization. The model offers a high degree of learnt invariance by utilizing viewpoint-dependent mixture components and movable parts in each mixture component. One might hope to increase the accuracy of the DPM by increasing the number of mixture components and parts to give a more faithful model, but limited training data prevents this from being effective. We propose an extension to the DPM which allows for sharing of object part models among multiple mixture components as well as object classes. This results in more compact models and allows training examples to be shared by multiple components, ameliorating the effect of a limited size training set. We (i) reformulate the DPM to incorporate part sharing, and (ii) propose a novel energy function allowing for coupled training of mixture components and object classes. We report state-of-the-art results on the PASCAL VOC dataset. 1.
Empowering Visual Categorization With the GPU
, 2011
"... Visual categorization is important to manage large collections of digital images and video, where textual metadata is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Visual categorization is important to manage large collections of digital images and video, where textual metadata is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. When optimizing a system based on the bag-of-words approach, the goal is to minimize the time it takes to process batches of images. In this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to 1) keep categorization accuracy intact, 2) decompose the problem, and 3) give the same numerical results. In the experiments on large scale datasets, it is shown that, by using a parallel implementation on the Geforce GTX260 GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29%.
Back to the Future: Learning Shape Models from 3D CAD Data
"... Recognizing 3D objects from arbitrary view points is one of the most fundamental problems in computer vision. A major challenge lies in the transition between the 3D geometry of objects and 2D representations that can be robustly matched to natural images. Most approaches thus rely on 2D natural ima ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Recognizing 3D objects from arbitrary view points is one of the most fundamental problems in computer vision. A major challenge lies in the transition between the 3D geometry of objects and 2D representations that can be robustly matched to natural images. Most approaches thus rely on 2D natural images either as the sole source of training data for building an implicit 3D representation, or by enriching 3D models with natural image features. In this paper, we go back to the ideas from the early days of computer vision, by using 3D object models as the only source of information for building a multi-view object class detector. In particular, we use these models for learning 2D shape that can be robustly matched to 2D natural images. Our experiments confirm the validity of our approach, which outperforms current state-of-the-art techniques on a multi-view detection data set. 1

