Results 1 - 10
of
41
Learning from one example through shared densities on transforms
- In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2000
"... We define a process called congealing in which elements of a dataset (images) are brought into correspondence with each other jointly, producing a data-defined model. It is based upon minimizing the summed component-wise (pixelwise) entropies over a continuous set of transforms on the data. One of t ..."
Abstract
-
Cited by 73 (7 self)
- Add to MetaCart
We define a process called congealing in which elements of a dataset (images) are brought into correspondence with each other jointly, producing a data-defined model. It is based upon minimizing the summed component-wise (pixelwise) entropies over a continuous set of transforms on the data. One of the biproducts of this minimization is a set of transforms, one associated with each original training sample. We then demonstrate a procedure for effectively bringing test data into correspondence with the data-defined model produced in the congealing process. Subsequently, we develop a probability density over the set of transforms that arose from the congealing process. We suggest that this density over transforms may be shared by many classes, and demonstrate how using this density as “prior knowledge ” can be used to develop a classifier based on only a single training example for each class. 1
Transformation-invariant clustering using the EM algorithm
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—Clustering is a simple, effective way to derive useful representations of data, such as images and videos. Clustering explains the input as one of several prototypes, plus noise. In situations where each input has been randomly transformed (e.g., by translation, rotation, and shearing in im ..."
Abstract
-
Cited by 47 (11 self)
- Add to MetaCart
Abstract—Clustering is a simple, effective way to derive useful representations of data, such as images and videos. Clustering explains the input as one of several prototypes, plus noise. In situations where each input has been randomly transformed (e.g., by translation, rotation, and shearing in images and videos), clustering techniques tend to extract cluster centers that account for variations in the input due to transformations, instead of more interesting and potentially useful structure. For example, if images from a video sequence of a person walking across a cluttered background are clustered, it would be more useful for the different clusters to represent different poses and expressions, instead of different positions of the person and different configurations of the background clutter. We describe a way to add transformation invariance to mixture models, by approximating the nonlinear transformation manifold by a discrete set of points. We show how the expectation maximization algorithm can be used to jointly learn clusters, while at the same time inferring the transformation associated with each input. We compare this technique with other methods for filtering noisy images obtained from a scanning electron microscope, clustering images from videos of faces into different categories of identification and pose and removing foreground obstructions from video. We also demonstrate that the new technique is quite insensitive to initial conditions and works better than standard techniques, even when the standard techniques are provided with extra data.
Stochastic rigidity: Image registration for nowhere-static scenes.
, 2001
"... We consider the registration of sequences of images where the observed scene is entirely non-rigid; for example a camera flying over water, a panning shot of a field of sunflowers in the wind, or footage of a crowd applauding at a sports event. In these cases, it is not possible to impose the constr ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
We consider the registration of sequences of images where the observed scene is entirely non-rigid; for example a camera flying over water, a panning shot of a field of sunflowers in the wind, or footage of a crowd applauding at a sports event. In these cases, it is not possible to impose the constraint that world points have similar colour in successive views, so existing registration techniques [1, 5, 9, 11] cannot be applied. Indeed the relationship between a point's colours in successive frames is essentially a random process. However, by treating the sequence of images as a set of samples from a multidimensional stochastic time-series, we can learn a stochastic model (e.g. an AR model [16, 23]) of the random process which generated the sequence of images. With a static camera, this stochastic model can be used to extend the sequence arbitrarily in time: driving the model with random noise results in an infinitely varying sequence of images which always looks like the short input sequence. In this way, we can create "videotextures" [21, 24] which can play forever without repetition. With a moving camera, the image generation process comprises two components---a stochastic component generated by the videotexture, and a parametric component due to the camera motion. For example, a camera rotation induces a relationship between successive images which is modelled by a 4-point perspective transformation, or homography. Human observers can easily separate the camera motion from the stochastic element. The key observation for an automatic implementation is that without image registration, the time-series analysis must work harder to model the combined stochastic and parametric image generation. Specifically, the learned model will require more components, or more coeffi...
Automatic Construction of Active Appearance Models as an Image Coding Problem
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... The automatic construction of Active Appearance Models (AAMs) is usually posed as finding the location of the base mesh vertices in the input training images. In this paper, we re-pose the problem as an energy-minimizing image coding problem and propose an efficient gradientdescent algorithm to s ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
The automatic construction of Active Appearance Models (AAMs) is usually posed as finding the location of the base mesh vertices in the input training images. In this paper, we re-pose the problem as an energy-minimizing image coding problem and propose an efficient gradientdescent algorithm to solve it.
Transformed Hidden Markov Models: Estimating Mixture Models of Images and Inferring Spatial Transformations in Video Sequences
- In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2000
"... Submitted to the IEEE Conference on Computer Vision and Pattern Recognition, 2000. In this paper we describe a novel generative model for video analysis called the transformed hidden Markov model (THMM). The video sequence is modeled as a set of frames generated by transforming a small number of cl ..."
Abstract
-
Cited by 38 (10 self)
- Add to MetaCart
Submitted to the IEEE Conference on Computer Vision and Pattern Recognition, 2000. In this paper we describe a novel generative model for video analysis called the transformed hidden Markov model (THMM). The video sequence is modeled as a set of frames generated by transforming a small number of class images that summarize the sequence. For each frame, the transformation and the class are discrete latent variables that depend on the previous class and transformation in the sequence. The set of possible transformations is de- ned in advance, and it can include a variety of transformation such as translation, rotation and shearing. In each stage of such a Markov model, a new frame is generated from a transformed Gaussian distribution based on the class/transformation combination generated by the Markov chain. This model can be viewed as an extension of a transformed mixture of Gaussians [1] through time. We use this model to cluster unlabeled video segments and form a video summary in ...
A comparison of algorithms for inference and learning in probabilistic graphical models
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... Computer vision is currently one of the most exciting areas of artificial intelligence re-search, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern clas-sification problems such as handwr ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Computer vision is currently one of the most exciting areas of artificial intelligence re-search, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern clas-sification problems such as handwritten character recognition and face detection, it is even more exciting that researchers may be on the verge of introducing computer vision systems that perform scene analysis, decomposing image input into its constituent objects, lighting conditions, motion patterns, and so on. Two of the main challenges in computer vision are finding efficient models of the physics of visual scenes and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graph-based probability models and their associated inference and learning algorithms for computer vision and scene analysis. We review exact techniques and various approximate, computationally efficient techniques, including iterative conditional modes, the expectation maximization (EM) algorithm, the mean field method, variational techniques, structured variational techniques, Gibbs sampling, the sum-product algorithm and “loopy ” belief propagation. We describe how each technique can be applied in a model of multiple, occluding objects, and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.
Data driven image models through continuous joint alignment
- PAMI
, 2006
"... This paper presents a family of techniques that we call congealing for modeling image classes from data. The idea is to start with a set of images and make them appear as similar as possible by removing variability along the known axes of variation. This technique can be used to eliminate “nuisance ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
This paper presents a family of techniques that we call congealing for modeling image classes from data. The idea is to start with a set of images and make them appear as similar as possible by removing variability along the known axes of variation. This technique can be used to eliminate “nuisance” variables such as affine deformations from handwritten digits or unwanted bias fields from magnetic resonance images. In addition to separating and modeling the latent images—i.e., the images without the nuisance variables—we can model the nuisance variables themselves, leading to factorized generative image models. When nuisance variable distributions are shared between classes, one can share the knowledge learned in one task with another task, leading to efficient learning. We demonstrate this process by building a handwritten digit classifier from just a single example of each class. In addition to applications in handwritten character recognition, we describe in detail the application of bias removal from magnetic resonance images. Unlike previous methods, we use a separate, nonparametric model for the intensity values at each pixel. This allows us to leverage the data from the MR images of different patients to remove bias from each other. Only very weak assumptions are made about the distributions of intensity values in the images. In addition to the digit and MR applications, we discuss a number of other uses of congealing and describe experiments about the robustness and consistency of the method.
Modeling, clustering, and segmenting video with mixtures of dynamic textures
- PAMI
, 2008
"... A dynamic texture is a spatio-temporal generative model for video, which represents video sequences as observations from a linear dynamical system. This work studies the mixture of dynamic textures, a statistical model for an ensemble of video sequences that is sampled from a finite collection of v ..."
Abstract
-
Cited by 30 (12 self)
- Add to MetaCart
A dynamic texture is a spatio-temporal generative model for video, which represents video sequences as observations from a linear dynamical system. This work studies the mixture of dynamic textures, a statistical model for an ensemble of video sequences that is sampled from a finite collection of visual processes, each of which is a dynamic texture. An expectation-maximization (EM) algorithm is derived for learning the parameters of the model, and the model is related to previous works in linear systems, machine learning, timeseries clustering, control theory, and computer vision. Through experimentation, it is shown that the mixture of dynamic textures is a suitable representation for both the appearance and dynamics of a variety of visual processes that have traditionally been challenging for computer vision (for example, fire, steam, water, vehicle and pedestrian traffic, and so forth). When compared with state-of-the-art methods in motion segmentation, including both temporal texture methods and traditional representations (for example, optical flow or other localized motion representations), the mixture of dynamic textures achieves superior performance in the problems of clustering and segmenting video of such processes.
Audio-Video Sensor Fusion with Probabilistic Graphical Models
- in Proc. ECCV
, 2002
"... We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our mo ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
We present a new approach to modeling and processing multimedia data. This approach is based on graphical models that combine audio and video variables. We demonstrate it by developing a new algorithm for tracking a moving object in a cluttered, noisy scene using two microphones and a camera. Our model uses unobserved variables to describe the data in terms of the process that generates them. It is therefore able to capture and exploit the statistical structure of the audio and video data separately, as well as their mutual dependencies. Model parameters are learned from data via an EM algorithm, and automatic calibration is performed as part of this procedure. Tracking is done by Bayesian inference of the object location from data. We demonstrate successful performance on multimedia clips captured in real world scenarios using o#-the-shelf equipment.

