Results 1 - 10
of
79
LOCUS: Learning Object Classes with Unsupervised Segmentation
- in ICCV
, 2005
"... We address the problem of learning object class models and object segmentations from unannotated images. We introduce LOCUS (Learning Object Classes with Unsupervised Segmentation) which uses a generative probabilistic model to combine bottom-up cues of color and edge with top-down cues of shape and ..."
Abstract
-
Cited by 90 (5 self)
- Add to MetaCart
We address the problem of learning object class models and object segmentations from unannotated images. We introduce LOCUS (Learning Object Classes with Unsupervised Segmentation) which uses a generative probabilistic model to combine bottom-up cues of color and edge with top-down cues of shape and pose. A key aspect of this model is that the object appearance is allowed to vary from image to image, allowing for significant within-class variation. By iteratively updating the belief in the object’s position, size, segmentation and pose, LOCUS avoids making hard decisions about any of these quantities and so allows for each to be refined at any stage. We show that LOCUS successfully learns an object class model from unlabeled images, whilst also giving segmentation accuracies that rival existing supervised methods. Finally, we demonstrate simultaneous recognition and segmentation in novel images using the learned models for a number of object classes, as well as unsupervised object discovery and tracking in video. 1.
Bilayer segmentation of live video
- In: IEEE Conference on Computer Vision and Pattern Recognition
, 2006
"... a input sequence b automatic layer separation and background substitution in three different frames Figure 1: An example of automatic foreground/background segmentation in monocular image sequences. Despite the challenging foreground motion the person is accurately extracted from the sequence and th ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
a input sequence b automatic layer separation and background substitution in three different frames Figure 1: An example of automatic foreground/background segmentation in monocular image sequences. Despite the challenging foreground motion the person is accurately extracted from the sequence and then composited free of aliasing upon a different background; a useful tool in video-conferencing applications. The sequences and ground truth data used throughout this paper are available from [1]. This paper presents an algorithm capable of real-time separation of foreground from background in monocular video sequences. Automatic segmentation of layers from colour/contrast or from motion alone is known to be error-prone. Here motion, colour and contrast cues are probabilistically fused together with spatial and temporal priors to infer layers accurately and efficiently. Central to our algorithm is the fact that pixel velocities are not needed, thus removing the need for optical flow estimation, with its tendency to error and computational expense. Instead, an efficient motion vs nonmotion classifier is trained to operate directly and jointly on intensity-change and contrast. Its output is then fused with colour information. The prior on segmentation is represented by a second order, temporal, Hidden Markov Model, together with a spatial MRF favouring coherence except where contrast is high. Finally, accurate layer segmentation and explicit occlusion detection are efficiently achieved by binary graph cut. The segmentation accuracy of the proposed algorithm is quantitatively evaluated with respect to existing groundtruth data and found to be comparable to the accuracy of a state of the art stereo segmentation algorithm. Foreground/background segmentation is demonstrated in the application of live background substitution and shown to generate convincingly good quality composite video. 1 1.
Describing visual scenes using transformed dirichlet processes
- Advances in Neural Information Processing Systems 18
, 2005
"... Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach captures the intrinsic uncertainty in the number and identity o ..."
Abstract
-
Cited by 47 (6 self)
- Add to MetaCart
Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach captures the intrinsic uncertainty in the number and identity of objects depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, and allows unsupervised discovery of object categories. 1
A Graphical Model for Audiovisual Object Tracking
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our mo ..."
Abstract
-
Cited by 36 (0 self)
- Add to MetaCart
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using off-the-shelf equipment.
A comparison of algorithms for inference and learning in probabilistic graphical models
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... Computer vision is currently one of the most exciting areas of artificial intelligence re-search, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern clas-sification problems such as handwr ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Computer vision is currently one of the most exciting areas of artificial intelligence re-search, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern clas-sification problems such as handwritten character recognition and face detection, it is even more exciting that researchers may be on the verge of introducing computer vision systems that perform scene analysis, decomposing image input into its constituent objects, lighting conditions, motion patterns, and so on. Two of the main challenges in computer vision are finding efficient models of the physics of visual scenes and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graph-based probability models and their associated inference and learning algorithms for computer vision and scene analysis. We review exact techniques and various approximate, computationally efficient techniques, including iterative conditional modes, the expectation maximization (EM) algorithm, the mean field method, variational techniques, structured variational techniques, Gibbs sampling, the sum-product algorithm and “loopy ” belief propagation. We describe how each technique can be applied in a model of multiple, occluding objects, and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.
Automatic Non-Rigid 3D Modeling from Video
- IN ECCV
, 2004
"... We present a robust framework for estimating non-rigid 3D shape and motion in video sequences. Given an input video sequence, and a user-specified region to reconstruct, the algorithm automatically solves for the 3D time-varying shape and motion of the object, and estimates which pixels are outl ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We present a robust framework for estimating non-rigid 3D shape and motion in video sequences. Given an input video sequence, and a user-specified region to reconstruct, the algorithm automatically solves for the 3D time-varying shape and motion of the object, and estimates which pixels are outliers, while learning all system parameters, including a PDF over non-rigid deformations. There are no user-tuned parameters (other than initialization); all parameters are learned by maximizing the likelihood of the entire image stream. We apply our method to both rigid and non-rigid shape reconstruction, and demonstrate it in challenging cases of occlusion and variable illumination.
Depth from familiar objects: A hierarchical model for 3D scenes
- Ian Horrocks, Ralf Möller and Peter Patel-Schneider. http://kogs-www.informatik.uni-hamburg.de/~moeller/dl-benchmark-suite.html
, 2006
"... We develop an integrated, probabilistic model for the appearance and three-dimensional geometry of cluttered scenes. Object categories are modeled via distributions over the 3D location and appearance of visual features. Uncertainty in the number of object instances depicted in a particular image is ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
We develop an integrated, probabilistic model for the appearance and three-dimensional geometry of cluttered scenes. Object categories are modeled via distributions over the 3D location and appearance of visual features. Uncertainty in the number of object instances depicted in a particular image is then achieved via a transformed Dirichlet process. In contrast with image-based approaches to object recognition, we model scale variations as the perspective projection of objects in different 3D poses. To calibrate the underlying geometry, we incorporate binocular stereo images into the training process. A robust likelihood model accounts for outliers in matched stereo features, allowing effective learning of 3D object structure from partial 2D segmentations. Applied to a dataset of office scenes, our model detects objects at multiple scales via a coarse reconstruction of the corresponding 3D geometry. 1.
Describing Visual Scenes Using Transformed Objects and Parts
- INT J COMPUT VIS
, 2005
"... We develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. Our approach couples topic models originally developed for text analysis with spatial transformations, and thus consistently accounts for geometric constraints. By building i ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
We develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. Our approach couples topic models originally developed for text analysis with spatial transformations, and thus consistently accounts for geometric constraints. By building integrated scene models, we may discover contextual relationships, and better exploit partially labeled training images. We first consider images of isolated objects, and show that sharing parts among object categories improves detection accuracy when learning from few examples. Turning to multiple object scenes, we propose nonparametric models which use Dirichlet processes to automatically learn the number of parts underlying each object category, and objects composing each scene. The resulting transformed Dirichlet process (TDP) leads to Monte Carlo algorithms which simultaneously segment and recognize objects in street and office scenes.
Audio-Video Sensor Fusion with Probabilistic Graphical Models
- in Proc. ECCV
, 2002
"... We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our mo ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using o#-the-shelf equipment.
Learning Joint Top-down and Bottom-up Processes for 3D Visual Inference
- In IEEE International Conference on Computer Vision and Pattern Recognition
, 2006
"... We present an algorithm for jointly learning a consistent bidirectional generative-recognition model that combines top-down and bottom-up processing for monocular 3d human motion reconstruction. Learning progresses in alternative stages of self-training that optimize the probability of the image evi ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
We present an algorithm for jointly learning a consistent bidirectional generative-recognition model that combines top-down and bottom-up processing for monocular 3d human motion reconstruction. Learning progresses in alternative stages of self-training that optimize the probability of the image evidence: the recognition model is tunned using samples from the generative model and the generative model is optimized to produce inferences close to the ones predicted by the current recognition model. At equilibrium, the two models are consistent. During on-line inference, we scan the image at multiple locations and predict 3d human poses using the recognition model. But this implicitly includes one-shot generative consistency feedback. The framework provides a uniform treatment of human detection, 3d initialization and 3d recovery from transient failure. Our experimental results show that this procedure is promising for the automatic reconstruction of human motion in more natural scene settings with background clutter and occlusion. 1.

