Results 1 - 10
of
70
Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation
- In ICCV
, 2009
"... Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neig ..."
Abstract
-
Cited by 23 (8 self)
- Add to MetaCart
Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. We investigate the performance of different variants of our model and compare to existing work. We present experimental results for three challenging data sets. On all three, TagProp makes a marked improvement as compared to the current state-of-the-art. 1.
Bridging the gap: Query by semantic example
- IEEE TRANS. MULTIMEDIA
, 2007
"... A combination of query-by-visual-example (QBVE) and semantic retrieval (SR), denoted as query-by-semantic-example (QBSE), is proposed. Images are labeled with respect to a vocabulary of visual concepts, as is usual in SR. Each image is then represented by a vector, referred to as a semantic multinom ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
A combination of query-by-visual-example (QBVE) and semantic retrieval (SR), denoted as query-by-semantic-example (QBSE), is proposed. Images are labeled with respect to a vocabulary of visual concepts, as is usual in SR. Each image is then represented by a vector, referred to as a semantic multinomial, of posterior concept probabilities. Retrieval is based on the query-by-example paradigm: the user provides a query image, for which 1) a semantic multinomial is computed and 2) matched to those in the database. QBSE is shown to have two main properties of interest, one mostly practical and the other philosophical. From a practical standpoint, because it inherits the generalization ability of SR inside the space of known visual concepts (referred to as the semantic space) but performs much better outside of it, QBSE produces retrieval systems that are more accurate than what was previously possible. Philosophically, because it allows a direct comparison of visual and semantic representations under a common query paradigm, QBSE enables the design of experiments that explicitly test the value of semantic representations for image retrieval. An implementation of QBSE under the minimum probability of error (MPE) retrieval framework, previously applied with success to both QBVE and SR, is proposed, and used to demonstrate the two properties. In particular, an extensive objective comparison of QBSE with QBVE is presented, showing that the former significantly outperforms the latter both inside and outside the semantic space. By carefully controlling the structure of the semantic space, it is also shown that this improvement can only be attributed to the semantic nature of the representation on which QBSE is based.
A New Baseline for Image Annotation
"... Abstract. Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, mo ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
Abstract. Automatically assigning keywords to images is of great interest as it allows one to index, retrieve, and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, most of these works fail to compare their methods with simple baseline techniques to justify the need for complex models and subsequent training. In this work, we introduce a new baseline technique for image annotation that treats annotation as a retrieval problem. The proposed technique utilizes low-level image features and a simple combination of basic distances to find nearest neighbors of a given image. The keywords are then assigned using a greedy label transfer mechanism. The proposed baseline outperforms the current state-of-the-art methods on two standard and one large Web dataset. We believe that such a baseline measure will provide a strong platform to compare and better understand future annotation techniques. 1
A game-based approach for collecting semantic annotations of music
- In 8th International Conference on Music Information Retrieval (ISMIR
, 2007
"... Games based on human computation are a valuable tool for collecting semantic information about images. We show how to transfer this idea into the music domain in order to collect high-quality semantic information about songs. We present Listen Game, a online, multiplayer game that measures the seman ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Games based on human computation are a valuable tool for collecting semantic information about images. We show how to transfer this idea into the music domain in order to collect high-quality semantic information about songs. We present Listen Game, a online, multiplayer game that measures the semantic relationship between music and words. In the normal mode, a player sees a list of semantically related words (e.g., instruments, emotions, usages, genres) and is asked to pick the best and worst word to describe a song. In the freestyle mode, a user is asked to suggest a new word that describes the music. Each player receives realtime feedback about the agreement amongst all players. We show that we can use the data collected during a two-week pilot study of Listen Game to learn a supervised multiclass labeling (SML) model. We show that this SML model can annotate a novel song with meaningful words and retrieve relevant songs from a database of audio content. 1
FIVE APPROACHES TO COLLECTING TAGS FOR MUSIC
"... We compare five approaches to collecting tags for music: conducting a survey, harvesting social tags, deploying annotation games, mining web documents, and autotagging audio content. The comparison includes a discussion of both scalability (financial cost, human involvement, and computational resour ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
We compare five approaches to collecting tags for music: conducting a survey, harvesting social tags, deploying annotation games, mining web documents, and autotagging audio content. The comparison includes a discussion of both scalability (financial cost, human involvement, and computational resources) and quality (the cold start problem & popularity bias, strong vs. weak labeling, vocabulary structure & size, and annotation accuracy). We then describe one state-ofthe-art system for each approach. The performance of each system is evaluated using a tag-based music information retrieval task. Using this task, we are able to quantify the effect of popularity bias on each approach by making use of a subset of more popular (short-head) songs and a set of less popular (long-tail) songs. Lastly, we propose a simple hybrid context-content system that combines our individual approaches and produces superior retrieval results. 1
Towards musical query-by-semantic-description using the CAL500 data set
- Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
, 2007
"... Query-by-semantic-description (QBSD) is a natural paradigm for retrieving content from large databases of music. A major impediment to the development of good QBSD systems for music information retrieval has been the lack of a cleanlylabeled, publicly-available, heterogeneous data set of songs and a ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Query-by-semantic-description (QBSD) is a natural paradigm for retrieving content from large databases of music. A major impediment to the development of good QBSD systems for music information retrieval has been the lack of a cleanlylabeled, publicly-available, heterogeneous data set of songs and associated annotations. We have collected the Computer Audition Lab 500-song (CAL500) data set by having humans listen to and annotate songs using a survey designed to capture ‘semantic associations ’ between music and words. We adapt the supervised multi-class labeling (SML) model, which has shown good performance on the task of image retrieval, and use the CAL500 data to learn a model for music retrieval. The model parameters are estimated using the weighted mixture hierarchies expectation-maximization algorithm which has been specifically designed to handle realvalued semantic association between words and songs, rather than binary class labels. The output of the SML model, a vector of class-conditional probabilities, can be interpreted as a semantic multinomial distribution over a vocabulary. By also representing a semantic query as a query multinomial distribution, we can quickly rank order the songs in a database based on the Kullback-Leibler divergence between the query multinomial and each song’s semantic multinomial. Qualitative and quantitative results demonstrate that our SML model can both annotate a novel song with meaningful words and retrieve relevant songs given a multi-word, text-based query.
Scene Classification with Low-dimensional Semantic Spaces and Weak Supervision
"... A novel approach to scene categorization is proposed. Similar to previous works of [11, 15, 3, 12], we introduce an intermediate space, based on a low dimensional semantic “theme ” image representation. However, instead of learning the themes in an unsupervised manner, they are learned with weak sup ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
A novel approach to scene categorization is proposed. Similar to previous works of [11, 15, 3, 12], we introduce an intermediate space, based on a low dimensional semantic “theme ” image representation. However, instead of learning the themes in an unsupervised manner, they are learned with weak supervision, from casual image annotations. Each theme induces a probability density on the space of low-level features, and images are represented as vectors of posterior theme probabilities. This enables an image to be associated with multiple themes, even when there are no multiple associations in the training labels. An implementation is presented and compared to various existing algorithms, on benchmark datasets. It is shown that the proposed low dimensional representation correlates well with human scene understanding, and is able to learn theme co-occurrences without explicit training. It is also shown to outperform unsupervised latent-space methods, with much smaller training complexity, and to achieve performance close to the state of the art methods, which rely on much higher-dimensional image representations. Finally a study of the effect of dimensionality on the classification performance is presented, indicating that the dimensionality of theme space grows sub-linearly with the number of scene categories. 1.
A New Approach to Cross-Modal Multimedia Retrieval
"... The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investig ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for crossmodal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task.
Holistic context modeling using semantic co-occurences
- In Proceedings of IEEE CVPR
, 2009
"... We present a simple framework to model contextual relationships between visual concepts. The new framework combines ideas from previous object-centric methods (which model contextual relationships between objects in an image, such as their co-occurrence patterns) and scenecentric methods (which lear ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
We present a simple framework to model contextual relationships between visual concepts. The new framework combines ideas from previous object-centric methods (which model contextual relationships between objects in an image, such as their co-occurrence patterns) and scenecentric methods (which learn a holistic context model from the entire image, known as its “gist”). This is accomplished without demarcating individual concepts or regions in the image. First, using the output of a generic appearance based concept detection system, a semantic space is formulated, where each axis represents a semantic feature. Next, context models are learned for each of the concepts in the semantic space, using mixtures of Dirichlet distributions. Finally, an image is represented as a vector of posterior concept probabilities under these contextual concept models. It is shown that these posterior probabilities are remarkably noise-free, and an effective model of the contextual relationships between semantic concepts in natural images. This is further demonstrated through an experimental evaluation with respect to two vision tasks, viz. scene classification and image annotation, on benchmark datasets. The results show that, besides quite simple to compute, the proposed context models attain superior performance than state of the art systems in both tasks. 1.

