Results 1  10
of
39
On affine invariant clustering and automatic cast listing in movies
 In Proc. ECCV
, 2002
"... Abstract We develop a distance metric for clustering and classification algorithms which is invariant to affine transformations and includes priors on the transformation parameters. Such clustering requirements are generic to a number of problems in computer vision. We extend existing techniques for ..."
Abstract

Cited by 73 (15 self)
 Add to MetaCart
Abstract We develop a distance metric for clustering and classification algorithms which is invariant to affine transformations and includes priors on the transformation parameters. Such clustering requirements are generic to a number of problems in computer vision. We extend existing techniques for affineinvariant clustering, and show that the new distance metric outperforms existing approximations to affine invariant distance computation, particularly under large transformations. In addition, we incorporate prior probabilities on the transformation parameters. This further regularizes the solution, mitigating a rare but serious tendency of the existing solutions to diverge. For the particular special case of corresponding point sets we demonstrate that the affine invariant measure we introduced may be obtained in closed form. As an application of these ideas we demonstrate that the faces of the principal cast of a feature film can be generated automatically using clustering with appropriate invariance. This is a very demanding test as it involves detecting and clustering over tens of thousands of images with the variances including changes in viewpoint, lighting, scale and expression. 1
Data driven image models through continuous joint alignment
 PAMI
, 2006
"... This paper presents a family of techniques that we call congealing for modeling image classes from data. The idea is to start with a set of images and make them appear as similar as possible by removing variability along the known axes of variation. This technique can be used to eliminate “nuisance ..."
Abstract

Cited by 59 (4 self)
 Add to MetaCart
This paper presents a family of techniques that we call congealing for modeling image classes from data. The idea is to start with a set of images and make them appear as similar as possible by removing variability along the known axes of variation. This technique can be used to eliminate “nuisance” variables such as affine deformations from handwritten digits or unwanted bias fields from magnetic resonance images. In addition to separating and modeling the latent images—i.e., the images without the nuisance variables—we can model the nuisance variables themselves, leading to factorized generative image models. When nuisance variable distributions are shared between classes, one can share the knowledge learned in one task with another task, leading to efficient learning. We demonstrate this process by building a handwritten digit classifier from just a single example of each class. In addition to applications in handwritten character recognition, we describe in detail the application of bias removal from magnetic resonance images. Unlike previous methods, we use a separate, nonparametric model for the intensity values at each pixel. This allows us to leverage the data from the MR images of different patients to remove bias from each other. Only very weak assumptions are made about the distributions of intensity values in the images. In addition to the digit and MR applications, we discuss a number of other uses of congealing and describe experiments about the robustness and consistency of the method.
Transformationinvariant clustering using the EM algorithm
 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2003
"... Clustering is a simple, effective way to derive useful representations of data, such as images and videos. Clustering explains the input as one of several prototypes, plus noise. In situations where each input has been randomly transformed (e.g., by translation, rotation, and shearing in images and ..."
Abstract

Cited by 58 (11 self)
 Add to MetaCart
Clustering is a simple, effective way to derive useful representations of data, such as images and videos. Clustering explains the input as one of several prototypes, plus noise. In situations where each input has been randomly transformed (e.g., by translation, rotation, and shearing in images and videos), clustering techniques tend to extract cluster centers that account for variations in the input due to transformations, instead of more interesting and potentially useful structure. For example, if images from a video sequence of a person walking across a cluttered background are clustered, it would be more useful for the different clusters to represent different poses and expressions, instead of different positions of the person and different configurations of the background clutter. We describe a way to add transformation invariance to mixture models, by approximating the nonlinear transformation manifold by a discrete set of points. We show how the expectation maximization algorithm can be used to jointly learn clusters, while at the same time inferring the transformation associated with each input. We compare this technique with other methods for filtering noisy images obtained from a scanning electron microscope, clustering images from videos of faces into different categories of identification and pose and removing foreground obstructions from video. We also demonstrate that the new technique is quite insensitive to initial conditions and works better than standard techniques, even when the standard techniques are provided with extra data.
Estimating Mixture Models of Images and Inferring Spatial Transformations Using the EM Algorithm
 IN PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION
, 1999
"... Mixture modeling and clustering algorithms are effective, simple ways to represent images using a set of data centers. However, in situations where the images include background clutter and transformations such as translation, rotation, shearing and warping, these methods extract data centers that ..."
Abstract

Cited by 55 (16 self)
 Add to MetaCart
Mixture modeling and clustering algorithms are effective, simple ways to represent images using a set of data centers. However, in situations where the images include background clutter and transformations such as translation, rotation, shearing and warping, these methods extract data centers that include clutter and represent dierent transformations of essentially the same data. Taking face images as an example, it would be more useful for the dierent clusters to represent dierent poses and expressions, instead of cluttered versions of dierent translations, scales and rotations. By including clutter and transformation as unobserved, latent variables in a mixture model, we obtain a new "transformed mixture of Gaussians", which is invariant to a specied set of transformations. We show how a lineartime EM algorithm can be used to t this model by jointly estimating a mixture mo...
What are textons
 International Journal of Computer Vision
, 2002
"... Abstract. Textons refer to fundamental microstructures in generic natural images and thus constitute the basic elements in early (preattentive) visual perception. However, the word “texton ” remains a vague concept in the literature of computer vision and visual perception, and a precise mathematic ..."
Abstract

Cited by 55 (16 self)
 Add to MetaCart
Abstract. Textons refer to fundamental microstructures in generic natural images and thus constitute the basic elements in early (preattentive) visual perception. However, the word “texton ” remains a vague concept in the literature of computer vision and visual perception, and a precise mathematical definition has yet to be found. In this article, we argue that the definition of texton should be governed by a sound mathematical model of images, and the set of textons must be learned from, or best tuned to, an image ensemble. We adopt a generative image model that an image is a superposition of bases from an overcomplete dictionary, then a texton is defined as a minitemplate that consists of a varying number of image bases with some geometric and photometric configurations. By analogy to physics, if image bases are like protons, neutrons and electrons, then textons are like atoms. Then a small number of textons can be learned from training images as repeating microstructures. We report four experiments for comparison. The first experiment computes clusters in feature space of filter responses. The second use transformed component analysis in both feature space and image patches. The third adopts a twolayer generative model where an image is generated by image bases and image bases are generated by textons. The fourth experiment shows textons from motion image sequences, which we call movetons. 1
Automatic Construction of Active Appearance Models as an Image Coding Problem
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... The automatic construction of Active Appearance Models (AAMs) is usually posed as finding the location of the base mesh vertices in the input training images. In this paper, we repose the problem as an energyminimizing image coding problem and propose an efficient gradientdescent algorithm to s ..."
Abstract

Cited by 53 (3 self)
 Add to MetaCart
The automatic construction of Active Appearance Models (AAMs) is usually posed as finding the location of the base mesh vertices in the input training images. In this paper, we repose the problem as an energyminimizing image coding problem and propose an efficient gradientdescent algorithm to solve it.
A comparison of algorithms for inference and learning in probabilistic graphical models
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... Computer vision is currently one of the most exciting areas of artificial intelligence research, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern classification problems such as handwr ..."
Abstract

Cited by 49 (4 self)
 Add to MetaCart
Computer vision is currently one of the most exciting areas of artificial intelligence research, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern classification problems such as handwritten character recognition and face detection, it is even more exciting that researchers may be on the verge of introducing computer vision systems that perform scene analysis, decomposing image input into its constituent objects, lighting conditions, motion patterns, and so on. Two of the main challenges in computer vision are finding efficient models of the physics of visual scenes and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graphbased probability models and their associated inference and learning algorithms for computer vision and scene analysis. We review exact techniques and various approximate, computationally efficient techniques, including iterative conditional modes, the expectation maximization (EM) algorithm, the mean field method, variational techniques, structured variational techniques, Gibbs sampling, the sumproduct algorithm and “loopy ” belief propagation. We describe how each technique can be applied in a model of multiple, occluding objects, and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.
RASL: Robust Alignment by Sparse and Lowrank Decomposition for Linearly Correlated Images ∗
"... This paper studies the problem of simultaneously aligning a batch of linearly correlated images despite gross corruption (such as occlusion). Our method seeks an optimal set of image domain transformations such that the matrix of transformed images can be decomposed as the sum of a sparse matrix of ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
This paper studies the problem of simultaneously aligning a batch of linearly correlated images despite gross corruption (such as occlusion). Our method seeks an optimal set of image domain transformations such that the matrix of transformed images can be decomposed as the sum of a sparse matrix of errors and a lowrank matrix of recovered aligned images. We reduce this extremely challenging optimization problem to a sequence of convex programs that minimize the sum of ℓ1norm and nuclear norm of the two component matrices, which can be efficiently solved by scalable convex optimization techniques with guaranteed fast convergence. We verify the efficacy of the proposed robust alignment algorithm with extensive experiments with both controlled and uncontrolled real data, demonstrating higher accuracy and efficiency than existing methods over a wide range of realistic misalignments and corruptions. 1.
Statistical Modeling and Conceptualization of Visual Patterns
, 2003
"... Natural images contain an overwhelming number of visual patterns generated by diverse stochastic processes. Defining and modeling these patterns is of fundamental importance for generic vision tasks, such as perceptual organization, segmentation, and recognition. The objective of this epistemologi ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
Natural images contain an overwhelming number of visual patterns generated by diverse stochastic processes. Defining and modeling these patterns is of fundamental importance for generic vision tasks, such as perceptual organization, segmentation, and recognition. The objective of this epistemological paper is to summarize various threads of research in the literature and to pursue a unified framework for conceptualization, modeling, learning, and computing visual patterns. This paper starts with reviewing four research streams: 1) the study of image statistics, 2) the analysis of image components, 3) the grouping of image elements, and 4) the modeling of visual patterns. The models from these research streams are then divided into four categories according to their semantic structures: 1) descriptive models, i.e., Markov random fields (MRF) or Gibbs, 2) variants of descriptive models (causal MRF and "pseudodescriptive" models), 3) generative models, and 4) discriminative models. The objectives, principles, theories, and typical models are reviewed in each category and the relationships between the four types of models are studied. Two central themes emerge from the relationship studies.
Learning appearance and transparency manifolds of occluded objects in layers
 CVPR03, I:45–52
, 2003
"... Videos and software available at www.psi.toronto.edu/layers.html By mapping a set of input images to points in a lowdimensional manifold or subspace, it is possible to efficiently account for a small number of degrees of freedom. For example, images of a person walking can be mapped to a 1dimension ..."
Abstract

Cited by 22 (5 self)
 Add to MetaCart
Videos and software available at www.psi.toronto.edu/layers.html By mapping a set of input images to points in a lowdimensional manifold or subspace, it is possible to efficiently account for a small number of degrees of freedom. For example, images of a person walking can be mapped to a 1dimensional manifold that measures the phase of the person’s gait. However, when the object is moving around the frame and being occluded by other objects, standard manifold modeling techniques (e.g., principal components analysis, factor analysis, locally linear embedding) try to account for global motion and occlusion. We show how factor analysis can be incorporated into a generative model of layered, 2.5dimensional vision, to jointly locate objects, resolve occlusion ambiguities, and learn models of the appearance manifolds of objects. We demonstrate the algorithm on a video consisting of four occluding objects, two of which are people who are walking, and occlude each other for most of the duration of the video. Whereas standard manifold modeling techniques fail to extract information about the gaits, the layered model successfully extracts a periodic representation of the gait of each person. 1