Results 1 - 10
of
33
Supervised learning of semantic classes for image annotation and retrieval
- IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2007
"... Abstract—A probabilistic formulation for semantic image annotation and retrieval is proposed. Annotation and retrieval are posed as classification problems where each class is defined as the group of database images labeled with a common semantic label. It is shown that, by establishing this one-to- ..."
Abstract
-
Cited by 74 (10 self)
- Add to MetaCart
Abstract—A probabilistic formulation for semantic image annotation and retrieval is proposed. Annotation and retrieval are posed as classification problems where each class is defined as the group of database images labeled with a common semantic label. It is shown that, by establishing this one-to-one correspondence between semantic labels and semantic classes, a minimum probability of error annotation and retrieval are feasible with algorithms that are 1) conceptually simple, 2) computationally efficient, and 3) do not require prior semantic segmentation of training images. In particular, images are represented as bags of localized feature vectors, a mixture density estimated for each image, and the mixtures associated with all images annotated with a common semantic label pooled into a density estimate for the corresponding semantic class. This pooling is justified by a multiple instance learning argument and performed efficiently with a hierarchical extension of expectation-maximization. The benefits of the supervised formulation over the more complex, and currently popular, joint modeling of semantic label and visual feature distributions are illustrated through theoretical arguments and extensive experiments. The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost. Finally, the proposed method is shown to be fairly robust to parameter tuning. Index Terms—Content-based image retrieval, semantic image annotation and retrieval, weakly supervised learning, multiple instance learning, Gaussian mixtures, expectation-maximization, image segmentation, object recognition. 1
Semantic annotation and retrieval of music and sound effects
- IEEE TASLP
, 2008
"... Abstract—We present a computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a text-based query. We consider the related tasks of content-based audio annotation and retrieval ..."
Abstract
-
Cited by 40 (16 self)
- Add to MetaCart
Abstract—We present a computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a text-based query. We consider the related tasks of content-based audio annotation and retrieval as one supervised multiclass, multilabel problem in which we model the joint probability of acoustic features and words. We collect a data set of 1700 human-generated annotations that describe 500 Western popular music tracks. For each word in a vocabulary, we use this data to train a Gaussian mixture model (GMM) over an audio feature space. We estimate the parameters of the model using the weighted mixture hierarchies expectation maximization algorithm. This algorithm is more scalable to large data sets and produces better density estimates than standard parameter estimation techniques. The quality of the music annotations produced by our system is comparable with the performance of humans on the same task. Our “query-by-text ” system can retrieve appropriate songs for a large number of musically relevant words. We also show that our audition system is general by learning a model that can annotate and retrieve sound effects. Index Terms—Audio annotation and retrieval, music information retrieval, semantic music analysis.
A Discriminative Kernel-based Model to Rank Images from Text Queries
, 2007
"... This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieva ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3 % average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.
Bridging the gap: Query by semantic example
- IEEE TRANS. MULTIMEDIA
, 2007
"... A combination of query-by-visual-example (QBVE) and semantic retrieval (SR), denoted as query-by-semantic-example (QBSE), is proposed. Images are labeled with respect to a vocabulary of visual concepts, as is usual in SR. Each image is then represented by a vector, referred to as a semantic multinom ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
A combination of query-by-visual-example (QBVE) and semantic retrieval (SR), denoted as query-by-semantic-example (QBSE), is proposed. Images are labeled with respect to a vocabulary of visual concepts, as is usual in SR. Each image is then represented by a vector, referred to as a semantic multinomial, of posterior concept probabilities. Retrieval is based on the query-by-example paradigm: the user provides a query image, for which 1) a semantic multinomial is computed and 2) matched to those in the database. QBSE is shown to have two main properties of interest, one mostly practical and the other philosophical. From a practical standpoint, because it inherits the generalization ability of SR inside the space of known visual concepts (referred to as the semantic space) but performs much better outside of it, QBSE produces retrieval systems that are more accurate than what was previously possible. Philosophically, because it allows a direct comparison of visual and semantic representations under a common query paradigm, QBSE enables the design of experiments that explicitly test the value of semantic representations for image retrieval. An implementation of QBSE under the minimum probability of error (MPE) retrieval framework, previously applied with success to both QBVE and SR, is proposed, and used to demonstrate the two properties. In particular, an extensive objective comparison of QBSE with QBVE is presented, showing that the former significantly outperforms the latter both inside and outside the semantic space. By carefully controlling the structure of the semantic space, it is also shown that this improvement can only be attributed to the semantic nature of the representation on which QBSE is based.
A New Baseline for Image Annotation
"... Abstract. Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, mo ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Abstract. Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, most of these works fail to compare their methods with simple baseline techniques to justify the need for complex models and subsequent training. In this work, we introduce a new baseline technique for image annotation that treats annotation as a retrieval problem. The proposed technique utilizes low-level image features and a simple combination of basic distances to find nearest neighbors of a given image. The keywords are then assigned using a greedy label transfer mechanism. The proposed baseline outperforms the current state-of-the-art methods on two standard and one large Web dataset. We believe that such a baseline measure will provide a strong platform to compare and better understand future annotation techniques. 1
Information-theoretic semantic multimedia indexing
- in ACM Conference on Image and Video Retrieval
, 2007
"... To solve the problem of indexing collections with diverse text documents, image documents, or documents with both text and images, one needs to develop a model that supports heterogeneous types of documents. In this paper, we show how information theory supplies us with the tools necessary to develo ..."
Abstract
-
Cited by 20 (10 self)
- Add to MetaCart
To solve the problem of indexing collections with diverse text documents, image documents, or documents with both text and images, one needs to develop a model that supports heterogeneous types of documents. In this paper, we show how information theory supplies us with the tools necessary to develop a unique model for text, image, and text/image retrieval. In our approach, for each possible query keyword we estimate a maximum entropy model based on exclusively continuous features that were preprocessed. The unique continuous feature-space of text and visual data is constructed by using a minimum description length criterion to find the optimal feature-space representation (optimal from an information theory point of view). We evaluate our approach in three experiments: only text retrieval, only image retrieval, and text combined with image retrieval.
Modeling music and words using a multi-class naive bayes approach
- In Proceedings of the International Symposium on Music Information Retrieval
, 2006
"... We propose a query-by-text system for modeling a heterogeneous data set of music and words. We quantitatively show that our system can both annotate a novel song with semantically meaningful words and retrieve relevant unlabeled songs from a database given a text-based query. We explain two feature ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
We propose a query-by-text system for modeling a heterogeneous data set of music and words. We quantitatively show that our system can both annotate a novel song with semantically meaningful words and retrieve relevant unlabeled songs from a database given a text-based query. We explain two feature extraction methods useful for summarizing the audio content of a song. We describe a supervised multi-class naïve Bayes model and compare two parameter estimation techniques. Our approach is influenced by recent computer vision research on the related tasks of image annotation and retrieval.
Audio information retrieval using semantic similarity
- In IEEE ICASSP
, 2007
"... We improve upon query-by-example for content-based audio information retrieval by ranking items in a database based on semantic similarity, rather than acoustic similarity, to a query example. The retrieval system is based on semantic concept models that are learned from a training data set containi ..."
Abstract
-
Cited by 13 (9 self)
- Add to MetaCart
We improve upon query-by-example for content-based audio information retrieval by ranking items in a database based on semantic similarity, rather than acoustic similarity, to a query example. The retrieval system is based on semantic concept models that are learned from a training data set containing both audio examples and their text captions. Using the concept models, the audio tracks are mapped into a semantic feature space, where each dimension indicates the strength of the semantic concept. Audio retrieval is then based on ranking the database tracks by their similarity to the query in the semantic space. We experiment with both semantic- and acousticbased retrieval systems on a sound effects database and show that the semantic-based system improves retrieval both quantitatively and qualitatively. Index Terms — computer audition, audio retrieval, semantic similarity 1.
Region-based image annotation using asymmetrical support vector machine-based multi-instance learning
- In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (June 17 - 22
, 2006
"... In region-based image annotation, keywords are usually associated with images instead of individual regions in the training data set. This poses a major challenge for any learning strategy. In this paper, we formulate image annotation as a supervised learning problem under Multiple-Instance Learning ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In region-based image annotation, keywords are usually associated with images instead of individual regions in the training data set. This poses a major challenge for any learning strategy. In this paper, we formulate image annotation as a supervised learning problem under Multiple-Instance Learning (MIL) framework. We present a novel Asymmetrical Support Vector Machine-based MIL algorithm (ASVM-MIL), which extends the conventional Support Vector Machine (SVM) to the MIL setting by introducing asymmetrical loss functions for false positives and false negatives. The proposed ASVM-MIL algorithm is evaluated on both image annotation data sets and the benchmark MUSK data sets. 1.
Using Language to Drive the Perceptual Grouping of Local Image Features
- IN PROC. IEEE CONF. ON COMPUTER VISION AND PATTERN RECOGNITION
, 2006
"... We address the problem of learning both the semantics (names) and the visual features (SIFT collections) of objects appearing in a training set of unstructured, captioned images of cluttered scenes. Prior work in applying machine translation models to learn the associations between image features an ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
We address the problem of learning both the semantics (names) and the visual features (SIFT collections) of objects appearing in a training set of unstructured, captioned images of cluttered scenes. Prior work in applying machine translation models to learn the associations between image features and caption nouns has assumed a one-toone correspondence between features and nouns. However, each training image may contain thousands of SIFT features belonging to multiple objects. Our challenge is two-fold: 1) grouping the SIFT features into meaningful collections, and 2) learning the object names associated with those collections. Since better collections tend to have stronger associations with object names, we offer an integrated solution that uses the caption words to drive the feature grouping process. The result is a more general model acquisition framework that does not assume words correspond to individual features and does not require training images with isolated objects or unambiguous labels. The model that is learned performs well at labeling cluttered scenes in a set of test images.

