Results 1 - 10
of
28
On affine invariant clustering and automatic cast listing in movies
- In Proc. ECCV
, 2002
"... Abstract We develop a distance metric for clustering and classification algorithms which is invariant to affine transformations and includes priors on the transformation parameters. Such clustering requirements are generic to a number of problems in computer vision. We extend existing techniques for ..."
Abstract
-
Cited by 57 (13 self)
- Add to MetaCart
Abstract We develop a distance metric for clustering and classification algorithms which is invariant to affine transformations and includes priors on the transformation parameters. Such clustering requirements are generic to a number of problems in computer vision. We extend existing techniques for affine-invariant clustering, and show that the new distance metric outperforms existing approximations to affine invariant distance computation, particularly under large transformations. In addition, we incorporate prior probabilities on the transformation parameters. This further regularizes the solution, mitigating a rare but serious tendency of the existing solutions to diverge. For the particular special case of corresponding point sets we demonstrate that the affine invariant measure we introduced may be obtained in closed form. As an application of these ideas we demonstrate that the faces of the principal cast of a feature film can be generated automatically using clustering with appropriate invariance. This is a very demanding test as it involves detecting and clustering over tens of thousands of images with the variances including changes in viewpoint, lighting, scale and expression. 1
Transformation-invariant clustering using the EM algorithm
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—Clustering is a simple, effective way to derive useful representations of data, such as images and videos. Clustering explains the input as one of several prototypes, plus noise. In situations where each input has been randomly transformed (e.g., by translation, rotation, and shearing in im ..."
Abstract
-
Cited by 47 (11 self)
- Add to MetaCart
Abstract—Clustering is a simple, effective way to derive useful representations of data, such as images and videos. Clustering explains the input as one of several prototypes, plus noise. In situations where each input has been randomly transformed (e.g., by translation, rotation, and shearing in images and videos), clustering techniques tend to extract cluster centers that account for variations in the input due to transformations, instead of more interesting and potentially useful structure. For example, if images from a video sequence of a person walking across a cluttered background are clustered, it would be more useful for the different clusters to represent different poses and expressions, instead of different positions of the person and different configurations of the background clutter. We describe a way to add transformation invariance to mixture models, by approximating the nonlinear transformation manifold by a discrete set of points. We show how the expectation maximization algorithm can be used to jointly learn clusters, while at the same time inferring the transformation associated with each input. We compare this technique with other methods for filtering noisy images obtained from a scanning electron microscope, clustering images from videos of faces into different categories of identification and pose and removing foreground obstructions from video. We also demonstrate that the new technique is quite insensitive to initial conditions and works better than standard techniques, even when the standard techniques are provided with extra data.
What are textons
- International Journal of Computer Vision
, 2002
"... Abstract. Textons refer to fundamental micro-structures in generic natural images and thus constitute the basic elements in early (preattentive) visual perception. However, the word “texton ” remains a vague concept in the literature of computer vision and visual perception, and a precise mathematic ..."
Abstract
-
Cited by 42 (15 self)
- Add to MetaCart
Abstract. Textons refer to fundamental micro-structures in generic natural images and thus constitute the basic elements in early (preattentive) visual perception. However, the word “texton ” remains a vague concept in the literature of computer vision and visual perception, and a precise mathematical definition has yet to be found. In this article, we argue that the definition of texton should be governed by a sound mathematical model of images, and the set of textons must be learned from, or best tuned to, an image ensemble. We adopt a generative image model that an image is a superposition of bases from an over-complete dictionary, then a texton is defined as a mini-template that consists of a varying number of image bases with some geometric and photometric configurations. By analogy to physics, if image bases are like protons, neutrons and electrons, then textons are like atoms. Then a small number of textons can be learned from training images as repeating micro-structures. We report four experiments for comparison. The first experiment computes clusters in feature space of filter responses. The second use transformed component analysis in both feature space and image patches. The third adopts a two-layer generative model where an image is generated by image bases and image bases are generated by textons. The fourth experiment shows textons from motion image sequences, which we call movetons. 1
Automatic Construction of Active Appearance Models as an Image Coding Problem
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2004
"... The automatic construction of Active Appearance Models (AAMs) is usually posed as finding the location of the base mesh vertices in the input training images. In this paper, we re-pose the problem as an energy-minimizing image coding problem and propose an efficient gradientdescent algorithm to s ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
The automatic construction of Active Appearance Models (AAMs) is usually posed as finding the location of the base mesh vertices in the input training images. In this paper, we re-pose the problem as an energy-minimizing image coding problem and propose an efficient gradientdescent algorithm to solve it.
A comparison of algorithms for inference and learning in probabilistic graphical models
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2005
"... Computer vision is currently one of the most exciting areas of artificial intelligence re-search, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern clas-sification problems such as handwr ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Computer vision is currently one of the most exciting areas of artificial intelligence re-search, largely because it has recently become possible to record, store and process large amounts of visual data. While impressive achievements have been made in pattern clas-sification problems such as handwritten character recognition and face detection, it is even more exciting that researchers may be on the verge of introducing computer vision systems that perform scene analysis, decomposing image input into its constituent objects, lighting conditions, motion patterns, and so on. Two of the main challenges in computer vision are finding efficient models of the physics of visual scenes and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graph-based probability models and their associated inference and learning algorithms for computer vision and scene analysis. We review exact techniques and various approximate, computationally efficient techniques, including iterative conditional modes, the expectation maximization (EM) algorithm, the mean field method, variational techniques, structured variational techniques, Gibbs sampling, the sum-product algorithm and “loopy ” belief propagation. We describe how each technique can be applied in a model of multiple, occluding objects, and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.
Data driven image models through continuous joint alignment
- PAMI
, 2006
"... This paper presents a family of techniques that we call congealing for modeling image classes from data. The idea is to start with a set of images and make them appear as similar as possible by removing variability along the known axes of variation. This technique can be used to eliminate “nuisance ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
This paper presents a family of techniques that we call congealing for modeling image classes from data. The idea is to start with a set of images and make them appear as similar as possible by removing variability along the known axes of variation. This technique can be used to eliminate “nuisance” variables such as affine deformations from handwritten digits or unwanted bias fields from magnetic resonance images. In addition to separating and modeling the latent images—i.e., the images without the nuisance variables—we can model the nuisance variables themselves, leading to factorized generative image models. When nuisance variable distributions are shared between classes, one can share the knowledge learned in one task with another task, leading to efficient learning. We demonstrate this process by building a handwritten digit classifier from just a single example of each class. In addition to applications in handwritten character recognition, we describe in detail the application of bias removal from magnetic resonance images. Unlike previous methods, we use a separate, nonparametric model for the intensity values at each pixel. This allows us to leverage the data from the MR images of different patients to remove bias from each other. Only very weak assumptions are made about the distributions of intensity values in the images. In addition to the digit and MR applications, we discuss a number of other uses of congealing and describe experiments about the robustness and consistency of the method.
Statistical Modeling and Conceptualization of Visual Patterns
, 2003
"... Natural images contain an overwhelming number of visual patterns generated by diverse stochastic processes. Defining and modeling these patterns is of fundamental importance for generic vision tasks, such as perceptual organization, segmentation, and recognition. The objective of this epistemologi ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Natural images contain an overwhelming number of visual patterns generated by diverse stochastic processes. Defining and modeling these patterns is of fundamental importance for generic vision tasks, such as perceptual organization, segmentation, and recognition. The objective of this epistemological paper is to summarize various threads of research in the literature and to pursue a unified framework for conceptualization, modeling, learning, and computing visual patterns. This paper starts with reviewing four research streams: 1) the study of image statistics, 2) the analysis of image components, 3) the grouping of image elements, and 4) the modeling of visual patterns. The models from these research streams are then divided into four categories according to their semantic structures: 1) descriptive models, i.e., Markov random fields (MRF) or Gibbs, 2) variants of descriptive models (causal MRF and "pseudodescriptive" models), 3) generative models, and 4) discriminative models. The objectives, principles, theories, and typical models are reviewed in each category and the relationships between the four types of models are studied. Two central themes emerge from the relationship studies.
Learning appearance and transparency manifolds of occluded objects in layers
- CVPR03, I:45–52
, 2003
"... Videos and software available at www.psi.toronto.edu/layers.html By mapping a set of input images to points in a lowdimensional manifold or subspace, it is possible to efficiently account for a small number of degrees of freedom. For example, images of a person walking can be mapped to a 1-dimension ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
Videos and software available at www.psi.toronto.edu/layers.html By mapping a set of input images to points in a lowdimensional manifold or subspace, it is possible to efficiently account for a small number of degrees of freedom. For example, images of a person walking can be mapped to a 1-dimensional manifold that measures the phase of the person’s gait. However, when the object is moving around the frame and being occluded by other objects, standard manifold modeling techniques (e.g., principal components analysis, factor analysis, locally linear embedding) try to account for global motion and occlusion. We show how factor analysis can be incorporated into a generative model of layered, 2.5-dimensional vision, to jointly locate objects, resolve occlusion ambiguities, and learn models of the appearance manifolds of objects. We demonstrate the algorithm on a video consisting of four occluding objects, two of which are people who are walking, and occlude each other for most of the duration of the video. Whereas standard manifold modeling techniques fail to extract information about the gaits, the layered model successfully extracts a periodic representation of the gait of each person. 1
Primal Sketch: Integrating Texture and Structure
- Computer Vision and Image Understanding
, 2006
"... Following Marr’s insight, we propose a generative image representation called primal sketch, which integrates two modeling components. The first component explains the structural part of an image, such as object boundaries, by a hidden layer of image primitives. The second component models the remai ..."
Abstract
-
Cited by 16 (9 self)
- Add to MetaCart
Following Marr’s insight, we propose a generative image representation called primal sketch, which integrates two modeling components. The first component explains the structural part of an image, such as object boundaries, by a hidden layer of image primitives. The second component models the remaining textural part without distinguishable elements by Markov random fields that interpolate the structural part of the image. We adopt an artist’s notion by calling the two components “sketchable ” and “non-sketchable ” parts respectively. A dictionary of image primitives are used for modeling structures in natural images, and each primitive is specified by variables for its photometric, geometric, and topological attributes. The primitives in the image representation are not independent but organized in an sketch graph. This sketch graph is modeled by a spatial Markov model that enforces Gestalt organizations. The inference of the sketch graph consists of two phases. Phase I sequentially adds the most prominent image primitives in a procedure similar to matching pursuit. Phase II edits the sketch graph by a number of graph operators to achieve good Gestalt organizations. Experiments show that the primal sketch model produces satisfactory results for a large number of generic images. The primal sketch model is not only a parsimonious image representation for lossy image coding, but also provides a meaningful mid-level generic representation for other vision tasks.
Visual Learning By Integrating Descriptive and Generative Methods
- Proc. of Int’l Conf. on Computer Vision
, 2001
"... 1 This paper presents a mathematical framework for visual learning that integrates two popular statistical learning paradigms in the literature: I). Descriptive learning, such as Markov random elds and minimax entropy learning, and II). Generative learning, such as PCA, ICA, TCA, image coding and H ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
1 This paper presents a mathematical framework for visual learning that integrates two popular statistical learning paradigms in the literature: I). Descriptive learning, such as Markov random elds and minimax entropy learning, and II). Generative learning, such as PCA, ICA, TCA, image coding and HMM. We apply this integrated learning framework to texton modeling, and we assume that an observed texture image is generated by multiple layers of hidden stochastic \texton processes" with each texton being a window function, like a mini-template or a wavelet, under ane transformations. The spatial arrangements of the textons are characterized by minimax entropy models. The texton processes generate images by occlusion or linear addition. Thus given a raw input image, the learning framework achieves four goals: i). Computing the appearance of the textons. ii). Inferring the hidden stochastic texton processes. iii). Learning Gibbs models for each texton process. and iv). Verifying the learnt textons and Gibbs models through random sampling and texture synthesis. The integrated framework subsumes the minimax entropy learning paradigm and creates a richer class of probability models for visual patterns, which are suited for middle level vision representations. Furthermore we show that the integration of descriptive and generative methods yields a natural and general framework of visual learning. We demonstrate the proposed framework and algorithms on many real images. 1

