Results 1 - 10
of
16
Real-time bag of words, approximately
- In Proc. ACM Int’l Conf. Image and Video Retrieval
, 2009
"... We start from the state-of-the-art Bag of Words pipeline that in the 2008 benchmarks of TRECvid and PASCAL yielded the best performance scores. We have contributed to that pipeline, which now forms the basis to compare various fast alternatives for all of its components: (i) For descriptor extractio ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We start from the state-of-the-art Bag of Words pipeline that in the 2008 benchmarks of TRECvid and PASCAL yielded the best performance scores. We have contributed to that pipeline, which now forms the basis to compare various fast alternatives for all of its components: (i) For descriptor extraction we propose a fast algorithm to densely sample SIFT and SURF, and we compare several variants of these descriptors. (ii) For descriptor projection we compare a k-means visual vocabulary with a Random Forest. As a preprojection step we experiment with PCA on the descriptors to decrease projection time. (iii) For classification we use Support Vector Machines and compare the χ 2 kernel with the RBF kernel. Our results lead to a 10-fold speed increase without any loss of accuracy and to a 30-fold speed increase with 17 % loss of accuracy, where the latter system does real-time classification at 26 images per second. Categories andSubjectDescriptors
Real-time visual concept classification
- IEEE TRANSACTIONS ON MULTIMEDIA
, 2010
"... As datasets grow increasingly large in content-based image and video retrieval, computational efficiency of concept classification is important. This paper reviews techniques to accelerate concept classification, where we show the trade-off between computational efficiency and accuracy. As a basis, ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
As datasets grow increasingly large in content-based image and video retrieval, computational efficiency of concept classification is important. This paper reviews techniques to accelerate concept classification, where we show the trade-off between computational efficiency and accuracy. As a basis, we use the Bag-of-Words algorithm that in the 2008 benchmarks of TRECVID and PASCAL lead to the best performance scores. We divide the evaluation in three steps: 1) Descriptor Extraction, where we evaluate SIFT, SURF, DAISY, and Semantic Textons. 2) Visual Word Assignment, where we compare a k-means visual vocabulary with a Random Forest and evaluate subsampling, dimension reduction with PCA, and division strategies of the Spatial Pyramid. 3) Classification, where we evaluate the 2, RBF, and Fast Histogram Intersection kernel for the SVM. Apart from the evaluation, we accelerate the calculation of densely sampled SIFT and SURF, accelerate nearest neighbor assignment, and improve accuracy of the Histogram Intersection kernel. We conclude by discussing whether further acceleration of the Bag-of-Words pipeline is possible. Our results lead to a 7-fold speed increase without accuracy loss, and a 70-fold speed increase with 3 % accuracy loss. The latter system does classification in real-time, which opens up new applications for automatic concept classification. For example, this system permits five standard desktop PCs to automatically tag for 20 classes all images that are currently uploaded to Flickr.
Learning TRECVID’08 High-Level Features from YouTube TM
"... Run No. Run ID Run Description infMAP (%) training on TV08 data 1 IUPR-TV-M SIFT visual words with maximum entropy 6.1 2 IUPR-TV-MF SIFT with maximum entropy, fused with 5.9 color+texture and motion (NN matching) 3 IUPR-TV-S SIFT visual words with SVMs 5.3 4 IUPR-TV-SF SIFT with SVMs, fused with 6.3 ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Run No. Run ID Run Description infMAP (%) training on TV08 data 1 IUPR-TV-M SIFT visual words with maximum entropy 6.1 2 IUPR-TV-MF SIFT with maximum entropy, fused with 5.9 color+texture and motion (NN matching) 3 IUPR-TV-S SIFT visual words with SVMs 5.3 4 IUPR-TV-SF SIFT with SVMs, fused with 6.3 color+texture and motion (NN matching) training on YouTube data (no use of standard training sets) 5 IUPR-YOUTUBE-S SIFT visual words with SVMs 2.2 6 IUPR-YOUTUBE-M SIFT visual words with maximum entropy 2.1 We participated in TRECVID’s High-level Features task [17] to investigate online video as an alternative data source for concept detector training. Such video material is publicly available in large quantities from portals like YouTube. In our setup, tags provided by users during video upload serve as weak ground truth labels, such that thousands of concepts can be learned without manual annotation effort. On the downside, online video as a domain is complex, and the labels associated with it are coarse and unreliable, such that performance loss can be expected compared to high-quality standard training sets. To find out if it is possible to train concept detectors on web video, our TRECVID experiments compare state-of-the-art (visual only) concept detection systems when (1) training on the standard TRECVID development data and (2) training on clips downloaded from YouTube. Our key observation is that YouTube-based detectors work well for some concepts, but are overall significantly outperformed by the “specialized ” systems trained on standard TRECVID’08 data (giving a infMAP of 2.2 % and 2.1% compared to 5.3 % and 6.1%). An in-depth analysis shows that a major reason for this seems to be redundancy in the TV08 dataset
Jointly optimising relevance and diversity in image retrieval
- In CIVR ’09: Proc. 8th ACM Int. Conf. on Image and Video Ret
, 2009
"... In this paper we present a method to jointly optimise the relevance and the diversity of the results in image retrieval. Without considering diversity, image retrieval systems often mainly find a set of very similar results, so called near duplicates, which is often not the desired behaviour. From t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper we present a method to jointly optimise the relevance and the diversity of the results in image retrieval. Without considering diversity, image retrieval systems often mainly find a set of very similar results, so called near duplicates, which is often not the desired behaviour. From the user perspective, the ideal result consists of documents which are not only relevant but ideally also diverse. Most approaches addressing diversity in image or information retrieval use a two-step approach where in a first step a set of potentially relevant images is determined and in a second step these images are reranked to be diverse among the first positions. In contrast to these approaches, our method addresses the problem directly and jointly optimises the diversity and the relevance of the images in the retrieval ranking using techniques inspired by dynamic programming algorithms. We quantitatively evaluate our method on the ImageCLEF 2008 photo retrieval data and obtain results which outperform the state of the art. Additionally, we perform a qualitative evaluation on a new product search task and it is observed that the diverse results are more attractive to an average user.
P.: Augmenting Bag-of-Words - Category Specific Features and Concept Reasoning
- In: Working Notes of CLEF 2010
, 2010
"... Abstract. In this paper we present our approach to the 2010 ImageClef PhotoAnnotation task. Based on the well-known bag-of-words approach we suggest two extensions. First, we analyzed the impact of category specific features and classifiers. In order to classify quality-related image categories we i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. In this paper we present our approach to the 2010 ImageClef PhotoAnnotation task. Based on the well-known bag-of-words approach we suggest two extensions. First, we analyzed the impact of category specific features and classifiers. In order to classify quality-related image categories we implemented a sharpness measure and use this as additional feature in the classification process. Second, we propose a postclassification step, which is based on the observation that many of the categories should be considered as being related to each other: Some categories exclude or allow for inference to others. We incorporate inference and exclusion rules by refining the classification results. The results we obtain show that both extensions can provide a classification performance increase when compared the the standard BoW approach. 1
S.: Detection of Visual Concepts and Annotation of Images using Predictive Clustering Trees. In: Working
"... Abstract. In this paper, we present a multiple targets classification system for visual concepts detection and image annotation. Multiple targets classification (MTC) is a variant of classification where an instance may belong to multiple classes at the same time. The system is composed of two parts ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. In this paper, we present a multiple targets classification system for visual concepts detection and image annotation. Multiple targets classification (MTC) is a variant of classification where an instance may belong to multiple classes at the same time. The system is composed of two parts: feature extraction and classification/annotation. The feature extraction part provides global and local descriptions of the images. These descriptions are then used to learn a classifier and to annotate an image with the corresponding concepts. To this end, we use predictive clustering trees (PCTs), which are capable to classify an instance to multiple classes at once, thus exploit the interactions that may occur among the different visual concepts (classes). Moreover, we constructed ensembles (random forests) of PCTs, to improve the predictive performance. We tested our system on the image database from the visual concept detection and annotation task part of ImageCLEF 2010. The extensive experiments conducted on the benchmark database show that our system has very high predictive performance and can be easily scaled to large number of images and visual concepts. 1
University of Marburg at TRECVID 2010: Semantic Indexing
"... In this paper, we summarize our results for the semantic indexing task at TRECVID 2010. Last year, we showed that the use of object detection results as an additional input for SVM-based concept classifiers improved the overall performance. This year, we investigated whether a state-of-the-art bag-o ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this paper, we summarize our results for the semantic indexing task at TRECVID 2010. Last year, we showed that the use of object detection results as an additional input for SVM-based concept classifiers improved the overall performance. This year, we investigated whether a state-of-the-art bag-of-visual-words (BoW) approach can also be improved by adding object-based features. In this context, Multiple Kernel Learning (MKL) was applied to find the best feature weighting. The experiments revealed that the supplementation of BoW-based features with object-based features significantly improved the concept detection performance. Furthermore, we showed that a more uniform distribution of kernel weights using l2-norm MKL gained better results. Altogether, our best run achieved a mean inferred average precision of 6.96 % and we submitted the best results for the concepts “vehicle ” and “ground_vehicle”. The results of our participation in the semantic indexing task (also known as high-level feature extraction task) are presented in this section in form of the requested structured abstract. In the following sections, we describe our system for semantic indexing along with the experimental results. In Section 2, the different feature types are explained. The Multiple Kernel Learning framework is discussed in Section 3, while the experimental results are presented in Section 4. Section 5 concludes the paper. “What approach or combination of approaches did you test in each of your submitted runs?”
Dense Interest Points
"... Local features or image patches have become a standard tool in computer vision, with numerous application domains. Roughly speaking, two different types of patchbased image representations can be distinguished: interest points, such as corners or blobs, whose position, scale and shape are computed b ..."
Abstract
- Add to MetaCart
Local features or image patches have become a standard tool in computer vision, with numerous application domains. Roughly speaking, two different types of patchbased image representations can be distinguished: interest points, such as corners or blobs, whose position, scale and shape are computed by a feature detector algorithm, and dense sampling, where patches of fixed size and shape are placed on a regular grid (possibly repeated over multiple scales). Interest points focus on ‘interesting ’ locations in the image and include various degrees of viewpoint and illumination invariance, resulting in better repeatability scores. Dense sampling, on the other hand, gives a better coverage of the image, a constant amount of features per image area, and simple spatial relations between features. In this paper, we propose a hybrid scheme, which we call dense interest points, where we start from densely sampled patches yet optimize their position and scale parameters locally. We investigate whether doing so it is possible to get the best of both worlds. 1.
Whatis theSpatial Extent ofan Object? J.R.R.Uijlings 1 A.W.M.Smeulders 1 1 Intelligent Systems LabAmsterdam,
"... This paper discusses the question: Can we improve the recognition of objects by using their spatial context? We start from Bag-of-Words models and use the Pascal 2007 dataset. We use the rough object bounding boxes that come with this dataset to investigate the fundamental gain context can bring. Ou ..."
Abstract
- Add to MetaCart
This paper discusses the question: Can we improve the recognition of objects by using their spatial context? We start from Bag-of-Words models and use the Pascal 2007 dataset. We use the rough object bounding boxes that come with this dataset to investigate the fundamental gain context can bring. Our main contributions are: (I) The result of Zhang et al. in CVPR07 that context is superfluous derived from the Pascal 2005 data set of 4 classes does not generalizetothisdataset. Forourlargerandmorerealistic dataset context is important indeed. (II) Using the rough bounding box to limit or extend the scope of an object during both training and testing, we find that the spatial extent
CONCEPT LEARNING FOR IMAGE AND VIDEO RETRIEVAL: THE INVERSE RANDOM UNDER SAMPLING APPROACH
"... A typical concept-detection problem is characterised by greatly disproportionate sizes of the populations of training samples in the concept and anti-concept classes. In many cases, the population of anti-concept (negative) examples outnumber the concept examples. In this paper, an inverse random un ..."
Abstract
- Add to MetaCart
A typical concept-detection problem is characterised by greatly disproportionate sizes of the populations of training samples in the concept and anti-concept classes. In many cases, the population of anti-concept (negative) examples outnumber the concept examples. In this paper, an inverse random under sampling method is proposed to solve this imbalance problem. By the proposed method of inverse under sampling of the anti-concept class we can construct a large number of concept detectors which in the fusion stage facilitate a fine control of both false negative rates and false positive rates. In this method the main emphasis in learning the discriminant functions is on the concept class, leading to an almost perfect separation of the two classes for each detector. The proposed methodology is applied to commonly-used video and image collection benchmarks: Mediamill and Scene datasets. The results indicate significant performance gains. For some concepts, the improvement in the average precision is by several orders of magnitude, and the mean average precision is 12 % and 17 % better for Mediamill and Scene datasets respectively when compared with conventionally trained logistic regression classifier. 1.

