Results 1 - 10
of
45
Using Multiple Segmentations to Discover Objects and their Extent in Image Collections
- CVPR
"... Given a large dataset of images, we seek to automatically determine the visually similar object and scene classes together with their image segmentation. To achieve this we combine two ideas: (i) that a set of segmented objects can be partitioned into visual object classes using topic discovery mode ..."
Abstract
-
Cited by 119 (19 self)
- Add to MetaCart
Given a large dataset of images, we seek to automatically determine the visually similar object and scene classes together with their image segmentation. To achieve this we combine two ideas: (i) that a set of segmented objects can be partitioned into visual object classes using topic discovery models from statistical text analysis; and (ii) that visual object classes can be used to assess the accuracy of a segmentation. To tie these ideas together we compute multiple segmentations of each image and then: (i) learn the object classes; and (ii) choose the correct segmentations. We demonstrate that such an algorithm succeeds in automatically discovering many familiar objects in a variety of image datasets, including those from Caltech, MSRC and LabelMe. 1.
Learning the discriminative powerinvariance trade-off
- In ICCV
, 2007
"... We investigate the problem of learning optimal descriptors for a given classification task. Many hand-crafted descriptors have been proposed in the literature for measuring visual similarity. Looking past initial differences, what really distinguishes one descriptor from another is the tradeoff that ..."
Abstract
-
Cited by 80 (3 self)
- Add to MetaCart
We investigate the problem of learning optimal descriptors for a given classification task. Many hand-crafted descriptors have been proposed in the literature for measuring visual similarity. Looking past initial differences, what really distinguishes one descriptor from another is the tradeoff that it achieves between discriminative power and invariance. Since this trade-off must vary from task to task, no single descriptor can be optimal in all situations. Our focus, in this paper, is on learning the optimal tradeoff for classification given a particular training set and prior constraints. The problem is posed in the kernel learning framework. We learn the optimal, domain-specific kernel as a combination of base kernels corresponding to base features which achieve different levels of trade-off (such as no invariance, rotation invariance, scale invariance, affine invariance, etc.) This leads to a convex optimisation problem with a unique global optimum which can be solved for efficiently. The method is shown to achieve state-of-the-art performance on the UIUC textures, Oxford flowers and Caltech 101 datasets. 1.
Hello! My name is... Buffy – Automatic naming of characters in TV video
- In BMVC
, 2006
"... We investigate the problem of automatically labelling appearances of characters in TV or film material. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precisio ..."
Abstract
-
Cited by 62 (9 self)
- Add to MetaCart
We investigate the problem of automatically labelling appearances of characters in TV or film material. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking; (iii) using complementary cues of face matching and clothing matching to propose common annotations for face tracks. Results are presented on episodes of the TV series “Buffy the Vampire Slayer”. 1
Semi-Local Affine Parts for Object Recognition
- In BMVC
, 2004
"... This paper proposes a new approach for finding expressive and geometrically invariant parts for modeling 3D objects. The approach relies on identifying groups of local affine regions (image features having a characteristic appearance and elliptical shape) that remain approximately affinely rigid acr ..."
Abstract
-
Cited by 50 (5 self)
- Add to MetaCart
This paper proposes a new approach for finding expressive and geometrically invariant parts for modeling 3D objects. The approach relies on identifying groups of local affine regions (image features having a characteristic appearance and elliptical shape) that remain approximately affinely rigid across a range of views of an object, and across multiple instances of the same object class. These groups, termed semi-local affine parts, are learned using correspondence search between pairs of unsegmented and cluttered input images, followed by validation against additional training images. The proposed approach is applied to the recognition of butterflies in natural imagery. 1.
Video data mining using configurations of viewpoint invariant regions
- In CVPR’04
"... We describe a method for obtaining the principal objects, characters and scenes in a video by measuring the reoccurrence of spatial configurations of viewpoint invariant features. We investigate two aspects of the problem: the scale of the configurations, and the similarity requirements for clusteri ..."
Abstract
-
Cited by 46 (4 self)
- Add to MetaCart
We describe a method for obtaining the principal objects, characters and scenes in a video by measuring the reoccurrence of spatial configurations of viewpoint invariant features. We investigate two aspects of the problem: the scale of the configurations, and the similarity requirements for clustering configurations. The problem is challenging firstly because an object can undergo substantial changes in imaged appearance throughout a video (due to viewpoint and illumination change, and partial occlusion), and secondly because configurations are detected imperfectly, so that inexact patterns must be matched. The novelty of the method is that viewpoint invariant features are used to form the configurations, and that efficient methods from the text analysis literature are employed to reduce the matching complexity. Examples of ‘mined ’ objects are shown for a feature length film and a sitcom. 1.
Names and faces in the news
- In Proc. CVPR
, 2004
"... We show quite good face clustering is possible for a dataset of inaccurately and ambiguously labelled face images. Our dataset is 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset is more realistic than usual face recognition d ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
We show quite good face clustering is possible for a dataset of inaccurately and ambiguously labelled face images. Our dataset is 44,773 face images, obtained by applying a face finder to approximately half a million captioned news images. This dataset is more realistic than usual face recognition datasets, because it contains faces captured “in the wild ” in a variety of configurations with respect to the camera, taking a variety of expressions, and under illumination of widely varying color. Each face image is associated with a set of names, automatically extracted from the associated caption. Many, but not all such sets contain the correct name. We cluster face images in appropriate discriminant coordinates. We use a clustering procedure to break ambiguities in labelling and identify incorrectly labelled faces. A merging procedure then identifies variants of names that refer to the same individual. The resulting representation can be used to label faces in news images or to organize news pictures by individuals present. An alternative view of our procedure is as a process that cleans up noisy supervised data. We demonstrate how to use entropy measures to evaluate such procedures. 1.
Digital tapestry
- In Proc. Computer Vision and Pattern Recognition (CVPR
, 2005
"... This paper addresses the novel problem of automatically synthesizing an output image from a large collection of different input images. The synthesized image, called a digital tapestry, can be viewed as a visual summary or a virtual ’thumbnail ’ of all the images in the input collection. The problem ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
This paper addresses the novel problem of automatically synthesizing an output image from a large collection of different input images. The synthesized image, called a digital tapestry, can be viewed as a visual summary or a virtual ’thumbnail ’ of all the images in the input collection. The problem of creating the tapestry is cast as a multi-class labeling problem such that each region in the tapestry is constructed from input image blocks that are salient and such that neighboring blocks satisfy spatial compatibility. This is formulated using a Markov Random Field and optimized via the graph cut based expansion move algorithm. The standard expansion move algorithm can only handle energies with metric terms, while our energy contains non-metric (soft and hard) constraints. Therefore we propose two novel contributions. First, we extend the expansion move algorithm for energy functions with non-metric hard constraints. Secondly, we modify it for functions with “almost ” metric soft terms, and show that it gives good results in practice. The proposed framework was tested on several consumer photograph collections, and the results are presented. 1
Clustering appearances of objects under varying illumination conditions
- In CVPR
, 2003
"... We introduce two appearance-based methods for clustering a set of images of 3-D objects, acquired under varying illumination conditions, into disjoint subsets corresponding to individual objects. The first algorithm is based on the concept of illumination cones. According to the theory, the clusteri ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
We introduce two appearance-based methods for clustering a set of images of 3-D objects, acquired under varying illumination conditions, into disjoint subsets corresponding to individual objects. The first algorithm is based on the concept of illumination cones. According to the theory, the clustering problem is equivalent to finding convex polyhedral cones in the high-dimensional image space. To efficiently determine the conic structures hidden in the image data, we introduce the concept of conic affinity which measures the likelihood of a pair of images belonging to the same underlying polyhedral cone. For the second method, we introduce another affinity measure based on image gradient comparisons. The algorithm operates directly on the image gradients by comparing the magnitudes and orientations of the image gradient at each pixel. Both methods have clear geometric motivations, and they operate directly on the images without the need for feature extraction or computation of pixel statistics. We demonstrate experimentally that both algorithms are surprisingly effective in clustering images acquired under varying illumination conditions with two large, well-known image data sets. 1
Joint Manifold Distance: a new approach to appearance based clustering
- Proceedings of IEEE Computer Socienty Conference on Computer Vision and Pattern Recognition
, 2003
"... We wish to match sets of images to sets of images where both sets are undergoing various distortions such as viewpoint and lighting changes. ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
We wish to match sets of images to sets of images where both sets are undergoing various distortions such as viewpoint and lighting changes.
Segmenting, Modeling, and Matching Video Clips Containing Multiple Moving Objects
, 2005
"... This paper presents a novel representation for dynamic scenes composed of multiple rigid objects that may undergo different motions and are observed by a moving camera. Multi–view constraints associated with groups of affine–covariant scene patches and a normalized description of their appearance ar ..."
Abstract
-
Cited by 25 (4 self)
- Add to MetaCart
This paper presents a novel representation for dynamic scenes composed of multiple rigid objects that may undergo different motions and are observed by a moving camera. Multi–view constraints associated with groups of affine–covariant scene patches and a normalized description of their appearance are used to segment a scene into its rigid components, construct three–dimensional models of these components, and match instances of models recovered from different image sequences. The proposed approach has been implemented, and it is applied to the detection and matching of moving objects in video sequences and to shot matching, i.e., the identification of shots that depict the same scene in a video clip.

