## A Statistical Model for General Contextual Object Recognition (2004)

### Cached

### Download Links

- [www.cs.ubc.ca]
- [www.cs.ubc.ca]
- [www.cs.ubc.ca]
- [www.cs.ubc.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | IN ECCV |

Citations: | 106 - 7 self |

### BibTeX

@INPROCEEDINGS{Carbonetto04astatistical,

author = {Peter Carbonetto and Nando de Freitas and Kobus Barnard},

title = {A Statistical Model for General Contextual Object Recognition},

booktitle = {IN ECCV},

year = {2004},

pages = {350--362},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

We consider object recognition as the process of attaching meaningful labels to specific regions of an image, and propose a model that learns spatial relationships between objects. Given a set of images and their associated text (e.g. keywords, captions, descriptions), the objective is to segment an image, in either a crude or sophisticated fashion, then to find the proper associations between words and regions. Previous models are limited by the scope of the representation. In particular, they fail to exploit spatial context in the images and words. We develop a more expressive model that takes this into account. We formulate a spatially consistent probabilistic mapping between continuous image feature vectors and the supplied word tokens. By learning both word-to-region associations and object relations, the proposed model augments scene segmentations due to smoothing implicit in spatial consistency. Context introduces cycles to the undirected graph, so we cannot rely on a straightforward implementation of the EM algorithm for estimating the model parameters and densities of the unknown alignment variables. Instead, we develop an approximate EM algorithm that uses loopy belief propagation in the inference step and iterative scaling on the pseudo-likelihood approximation in the parameter update step. The experiments indicate that our approximate inference and learning algorithm converges to good local solutions. Experiments on a diverse array of images show that spatial context considerably improves the accuracy of object recognition. Most

### Citations

2590 | Normalized cuts and image segmentation
- Shi, Malik
- 1997
(Show Context)
Citation Context ...ning set has a total of 55 distinct concepts. The frequencies of words in the CorelB labels and manual annotations are shown in Fig. 4. We consider two scenarios. In the first, we use Normalized Cuts =-=[22]-=- to segment the images. In the second scenario, we take on the object recognition task without the aid of a sophisticated segmentation algorithm, and instead construct a uniform grid of patches over t... |

1585 | Object recognition from local scale-invariant features - Lowe - 1999 |

1173 | The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1994
(Show Context)
Citation Context ...s a result, the learning problem is unsupervised (or semi-supervised). We adapt the work in another unsupervised problem — learning a lexicon from an aligned bitext in statistical machine translatio=-=n [9] �-=-�� to general object recognition, as first proposed in [13]. The data consists of images paired with associated text. Each image consists of a set of blobs that identify the objects in the scene. A bl... |

1155 | A performance evaluation of local descriptors - Mikolajczyk, Schmid |

922 | On the statistical analysis of dirty pictures
- Besag
- 1986
(Show Context)
Citation Context ...scaling (IS) works on arbitrary exponential models, but it is not a saving grace because convergence is exponentially-slow. An alternative to the maximum likelihood estimator is the pseudo-likelihood =-=[6]-=-, which maximises local neighbourhood conditional probabilities at sites in the MRF, independent of other sites. The conditionals over the neighbourhoods of the vertices allow the partition function t... |

871 | Object class recognition by unsupervised scale-invariant learning
- Fergus, Perona, et al.
- 2003
(Show Context)
Citation Context ...features that describes an object. Note that this does not imply that the scene is necessarily segmented, and one could easily implement scale-invariant descriptors to represent object classes, as in =-=[14, 12]. -=-Abstractly, a caption consists of a bag of semantic concepts that describes the objects contained in the image scene. For the time being, we restrict the set of concepts to English nouns (e.g. “bear... |

511 | Matching words and pictures
- Barnard, Duygulu, et al.
(Show Context)
Citation Context ...ges with meta data, news photos with captions, and Internet photo stock agencies). Previous work shows that it is reasonable to use such loosely labeled data for problems in vision and image retrieval=-=[1, 4, 13, 11, 2, 7]. -=-We stress that throughout this paper we use annotations solely for testing — training data includes only the text associated with entire images. We do so at a cost since we are no longer blessed wit... |

467 | E.: Learning low-level vision
- Freeman, Pasztor
(Show Context)
Citation Context ...signments of its neighbouring blobs. Due to the Markov assumption, we still retain some structure. One could further introduce relations at different scales using a hierarchical representation, as in =-=[15]-=-. Dependence between neighbouring objects introduces spatial context to the classification. Spatial context increases expressiveness; two words may be indistinguishable using low-level features such a... |

465 | Loopy belief propagation for approximate inference: An empirical study
- Murphy, Weiss, et al.
- 1999
(Show Context)
Citation Context ...dom field with 6 blob sites. We have omitted the n subscript. The Φ potentials are defined on the vertical lines, and Ψ on the horizontal lines. (b) The corresponding pseudo-likelihood approximation=-=. [19] on -=-the complete likelihood (1) to compute the marginals �p(anu = i) and �p(anu = i, anv = j). Since the partition function is intractable and the potentials over the cliques are not complete, paramet... |

443 | Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary
- Duygulu, Barnard, et al.
- 2002
(Show Context)
Citation Context ...ges with meta data, news photos with captions, and Internet photo stock agencies). Previous work shows that it is reasonable to use such loosely labeled data for problems in vision and image retrieval=-=[1, 4, 13, 11, 2, 7]. -=-We stress that throughout this paper we use annotations solely for testing — training data includes only the text associated with entire images. We do so at a cost since we are no longer blessed wit... |

331 | Modeling annotated data
- Blei, Jordan
- 2003
(Show Context)
Citation Context ...ges with meta data, news photos with captions, and Internet photo stock agencies). Previous work shows that it is reasonable to use such loosely labeled data for problems in vision and image retrieval=-=[1, 4, 13, 11, 2, 7]. -=-We stress that throughout this paper we use annotations solely for testing — training data includes only the text associated with entire images. We do so at a cost since we are no longer blessed wit... |

222 | Learning the semantics of words and pictures
- Barnard, Forsyth
- 2001
(Show Context)
Citation Context |

182 | Discriminative Random Fields: A Discriminative Framework for Contextual Interaction
- Kumar, Hebert
- 2003
(Show Context)
Citation Context ...roximate algorithm for estimating the parameters when the model is not completely observed and the partition function is intractable. Like previous work on detection of man-made structures using MRFs =-=[16, 17]-=-, we use pseudo-likelihood for parameter estimation, although we go further and consider the unsupervised setting in which we learn both the potentials and the labels. As with most algorithms based on... |

108 | Discriminative fields for modeling spatial dependencies in natural images
- Kumar, Hebert
(Show Context)
Citation Context ...roximate algorithm for estimating the parameters when the model is not completely observed and the partition function is intractable. Like previous work on detection of man-made structures using MRFs =-=[16, 17]-=-, we use pseudo-likelihood for parameter estimation, although we go further and consider the unsupervised setting in which we learn both the potentials and the labels. As with most algorithms based on... |

77 |
Clustering art
- Barnard, Duygulu, et al.
- 2001
(Show Context)
Citation Context |

77 |
R.L.: The mathematics of statistical machine translation: Parameter estimation
- Brown, Pietra, et al.
- 1993
(Show Context)
Citation Context ...s a result, the learning problem is unsupervised (or semi-supervised). We adapt the work in another unsupervised problem — learning a lexicon from an aligned bitext in statistical machine translation =-=[9]-=- — to general object recognition, as first proposed in [13]. The data consists of images paired with associated text. Each image consists of a set of blobs that identify the objects in the scene. A bl... |

42 |
The improved iterative scaling algorithm: A gentle introduction
- Berger
- 1997
(Show Context)
Citation Context ...djacent to node u and Znu(θ) is the partition function for the neighbourhood at site u in document n. Iterative scaling allows for a tractable update step by bounding the log pseudo-likelihood. As in=-= [5], we take the partial der-=-ivative of a tractable lower bound, Λ(θ), with respect to the model parameters, resulting in the equations ∂Λ ∂t(b ⋆ , w ⋆ ) = + N� Mn Ln � � δ(bnj = b ⋆ ) δ(wni = w ⋆ ) �p(an... |

42 | The unified propagation and scaling algorithm
- Teh, Welling
(Show Context)
Citation Context ...be the estimate of alignment anu = i conditional on the empirical distribution �p(anv = j) and the current parameters. To find the conditionals for (4), we run universal propagation and scaling (UPS=-=) [23] at -=-each pseudo-likelihood site nu with the neighbours v ∈ Nnu clamped to the current marginals �p(anv). UPS is exact because the undirected graph at each neighbourhood is a tree. Also note that (3) r... |

36 | The effects of segmentation and feature choice in a translation model of object recognition
- Barnard, Duygulu, et al.
- 2003
(Show Context)
Citation Context ...objects. The object recognition data has semantic information in the form of captions, so it is reasonable to expect that additional high-level information could improve segmentations. Barnard et al. =-=[3]-=- show that translation models can suggest appropriate blob merges based on word predictions. For instance, high-level groupings can link the black and white halves of a penguin. Spatial consistency le... |

29 | A framework for performance characterization of intermediate-level grouping modules
- Borra, Sarkar
- 1997
(Show Context)
Citation Context ...blobs — not objects — is a reasonable performance metric as it matches the objective functions of the translation models. We have yet to compare our models using the evaluation procedures proposed=-= in [8, 2]. The prediction er-=-ror is given by 1 N N� 1 Mn n=1 u=1 Mn � � � 1 − δ �anu = a (max) �� nu where a (max) nu is the model alignment with the highest probability and �anu is the ground-truth annotation.... |

24 |
Selection of Scale Invariant Neighborhoods for Object Class Recognition
- Dorkó, Schmid
- 2003
(Show Context)
Citation Context ...features that describes an object. Note that this does not imply that the scene is necessarily segmented, and one could easily implement scale-invariant descriptors to represent object classes, as in =-=[14, 12]. -=-Abstractly, a caption consists of a bag of semantic concepts that describes the objects contained in the image scene. For the time being, we restrict the set of concepts to English nouns (e.g. “bear... |

22 |
Bayesian feature weighting for unsupervised learning with application to object recognition
- Carbonetto, Freitas, et al.
- 2003
(Show Context)
Citation Context |

10 | Parameter estimation and model selection in image analysis using Gibbs-Markov random fields
- Seymour
- 1993
(Show Context)
Citation Context ...allow the partition function to decouple and render parameter estimation tractable. The pseudo-likelihood neglects long-range interactions, but empirical trials show reasonable and consistent results =-=[20]-=-. Essentially, the pseudo-likelihood is a product of undirected models, where each undirected model is a single latent variable anu and its observed partner bnu conditioned on the variables in its Mar... |

2 | P.: Parameter estimation for inhomogeneous Markov random fields using PseudoLikelihood
- Cadez, Smyth
- 1998
(Show Context)
Citation Context ...) are polynomial expressions where each term has degree |Nnu| + 1, we can find new parameter estimates by plugging the solution for (3) or (4) into the IS update θ (new) i = θi × △θi. Cadez and =-=Smyth [10]-=- prove that the gradient of the pseudo-likelihood with respect to a global parameter is indeed well-conditioned since it has a unique positive root. On large data sets, the IS updates are slow. Option... |