• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Efficient Additive Kernels via Explicit Feature Maps

by Andrea Vedaldi, Andrew Zisserman
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 245
Next 10 →

VLFeat -- An open and portable library of computer vision algorithms

by Andrea Vedaldi, et al. , 2010
"... ..."
Abstract - Cited by 526 (10 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...compare descriptors. Much better results can be obtained by pre-transforming the data through vl_homkermap, which computes an explicit feature map that“emulates”a non linear χ2-kernel as a linear one =-=[16]-=-. Then, training with 15 images for each of the 101 Caltech101 categories can be done by using vl_pegasos, an implementation of stochasting gradient SVM training. Results. The computation and quantiza...

Improving the Fisher kernel for large-scale image classification.

by Florent Perronnin , Jorge Sánchez , Thomas Mensink - In ECCV, , 2010
"... Abstract. The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enric ..."
Abstract - Cited by 362 (20 self) - Add to MetaCart
Abstract. The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained using only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant resources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
(Show Context)

Citation Context

...ort a faster training time and a small accuracy improvement over the SVM. However SR-KDA still scales in O(N 3 ). Wang et al. [14], Maji and Berg [15], Perronnin et al. [16] and Vedaldi and Zisserman =-=[17]-=- proposed different approximations for additive kernels. These algorithms scale linearly with the number of training samples while providing the same accuracy as the original non-linear SVM classifier...

Iterative quantization: A procrustean approach to learning binary codes

by Yunchao Gong, Svetlana Lazebnik - In Proc. of the IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR , 2011
"... This paper addresses the problem of learning similaritypreserving binary codes for efficient retrieval in large-scale image collections. We propose a simple and efficient alternating minimization scheme for finding a rotation of zerocentered data so as to minimize the quantization error of mapping t ..."
Abstract - Cited by 157 (6 self) - Add to MetaCart
This paper addresses the problem of learning similaritypreserving binary codes for efficient retrieval in large-scale image collections. We propose a simple and efficient alternating minimization scheme for finding a rotation of zerocentered data so as to minimize the quantization error of mapping this data to the vertices of a zero-centered binary hypercube. This method, dubbed iterative quantization (ITQ), has connections to multi-class spectral clustering and to the orthogonal Procrustes problem, and it can be used both with unsupervised data embeddings such as PCA and supervised embeddings such as canonical correlation analysis (CCA). Our experiments show that the resulting binary coding schemes decisively outperform several other state-of-the-art methods. 1.

Aggregating local image descriptors into compact codes

by Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge Sanchez, Patrick Pérez , Cordelia Schmid , 2011
"... ..."
Abstract - Cited by 127 (14 self) - Add to MetaCart
Abstract not found

Learning latent temporal structure for complex event detection

by Kevin Tang, Li Fei-fei, Daphne Koller - In CVPR , 2012
"... In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interestin ..."
Abstract - Cited by 75 (2 self) - Add to MetaCart
In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model. 1.
(Show Context)

Citation Context

...nterest point detector [11] and concatenated Histogram of Gradient (HOG) and Histogram of Flow (HOF) descriptors [12]. In addition, because [16] uses a χ 2 -SVM, we use the method of additive kernels =-=[25]-=- to approximate a χ 2 kernel for our BoW features to maintain efficient processing while increasing discriminative power. Because the public release of this dataset is not the full dataset used in the...

Return of the Devil in the Details: Delving Deep into Convolutional Nets

by Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman , 2014
"... The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compar ..."
Abstract - Cited by 71 (8 self) - Add to MetaCart
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.

Detecting Activities of Daily Living in First-Person Camera Views

by Hamed Pirsiavash, Deva Ramanan
"... We present a novel dataset and novel algorithms for the problem of detecting activities of daily living (ADL) in firstperson camera views. We have collected a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object ..."
Abstract - Cited by 65 (3 self) - Add to MetaCart
We present a novel dataset and novel algorithms for the problem of detecting activities of daily living (ADL) in firstperson camera views. We have collected a dataset of 1 million frames of dozens of people performing unscripted, everyday activities. The dataset is annotated with activities, object tracks, hand positions, and interaction events. ADLs differ from typical actions in that they can involve long-scale temporal structure (making tea can take a few minutes) and complex object interactions (a fridge looks different when its door is open). We develop novel representations including (1) temporal pyramids, which generalize the well-known spatial pyramid to approximate temporal correspondence when scoring a model and (2) composite object models that exploit the fact that objects look different when being interacted with. We perform an extensive empirical evaluation and demonstrate that our novel representations produce a two-fold improvement over traditional approaches. Our analysis suggests that real-world ADL recognition is “all about the objects, ” and in particular, “all about the objects being interacted with.” 1.
(Show Context)

Citation Context

...proximately “binarize” x, so that it softly encode the presence or lack thereof of object i (inspired by the clipping post-processing step in SIFT [26]).We experimented with various histogram kernals =-=[41]-=-, but found a simple linear kernal defined on an L1-normalized feature to work well. 4. Active object models Recognizing objects undergoing hand manipulations is a crucial aspect of wearable ADL recog...

Object Recognition as Ranking Holistic Figure-Ground Hypotheses

by Fuxin Li, Joao Carreira, Cristian Sminchisescu - In CVPR, 2010. 7
"... We present an approach to visual object-class recognition and segmentation based on a pipeline that combines multiple, holistic figure-ground hypotheses generated in a bottom-up, object independent process. Decisions are performed based on continuous estimates of the spatial overlap between image se ..."
Abstract - Cited by 55 (13 self) - Add to MetaCart
We present an approach to visual object-class recognition and segmentation based on a pipeline that combines multiple, holistic figure-ground hypotheses generated in a bottom-up, object independent process. Decisions are performed based on continuous estimates of the spatial overlap between image segment hypotheses and each putative class. We differ from existing approaches not only in our seemingly unreasonable assumption that good object-level segments can be obtained in a feed-forward fashion, but also in framing recognition as a regression problem. Instead of focusing on a one-vs-all winning margin that can scramble ordering inside the non-maximum (non-winning) set, learning produces a globally consistent ranking with close ties to segment quality, hence to the extent entire object or part hypotheses spatially overlap with the ground truth. We demonstrate results beyond the current state of the art for image classification, object detection and semantic segmentation, in a number of challenging datasets including Caltech-101, ETHZ-Shape and PASCAL VOC 2009. 1.
(Show Context)

Citation Context

...ear to be the only practical choice for learning. However, random Fourier approximations can be used to transform the features linearly, but still accurately approximate non-linear similarity measures=-=[48,4,60]-=-. The random Fourier methodology takes the initial kernel and generates a new set of features based on randomly sampling multiple components from its Fourier transform. A linear regressor on the trans...

Good Practice in Large-Scale Learning for Image Classification

by Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid - IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (TPAMI) , 2013
"... We benchmark several SVM objective functions for large-scale image classification. We consider one-vs-rest, multi-class, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives shows that online methods perform as well as batch methods i ..."
Abstract - Cited by 53 (6 self) - Add to MetaCart
We benchmark several SVM objective functions for large-scale image classification. We consider one-vs-rest, multi-class, ranking, and weighted approximate ranking SVMs. A comparison of online and batch methods for optimizing the objectives shows that online methods perform as well as batch methods in terms of classification accuracy, but with a significant gain in training speed. Using stochastic gradient descent, we can scale the training to millions of images and thousands of classes. Our experimental evaluation shows that ranking-based algorithms do not outperform the one-vs-rest strategy when a large number of training examples are used. Furthermore, the gap in accuracy between the different algorithms shrinks as the dimension of the features increases. We also show that learning through cross-validation the optimal rebalancing of positive and negative examples can result in a significant improvement for the one-vs-rest strategy. Finally, early stopping can be used as an effective regularization strategy when training with online algorithms. Following these “good practices”, we were able to improve the state-of-the-art on a large subset of 10K classes and 9M images of ImageNet from 16.7 % Top-1 accuracy to 19.1%.
(Show Context)

Citation Context

.... A fair amount of work has been devoted to scaling the learning algorithms to large datasets. An explicit mapping of the image descriptors to efficiently deal with non-linear kernels was proposed in =-=[21, 25, 36]-=-. Torresani et al. [33] used compact binary attribute descriptors to handle a large number of images. Sánchez and Perronnin [29] argued that high-dimensional image descriptors are necessary to obtain...

A.: Blocks that shout: Distinctive parts for scene classification

by Mayank Juneja, Andrea Vedaldi, C. V. Jawahar, Andrew Zisserman , 2013
"... The automatic discovery of distinctive parts for an ob-ject or scene class is challenging since it requires simulta-neously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We ad-dress this ..."
Abstract - Cited by 52 (1 self) - Add to MetaCart
The automatic discovery of distinctive parts for an ob-ject or scene class is challenging since it requires simulta-neously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We ad-dress this problem by learning parts incrementally, starting from a single part occurrence with an Exemplar SVM. In this manner, additional part instances are discovered and aligned reliably before being considered as training exam-ples. We also propose entropy-rank curves as a means of evaluating the distinctiveness of parts shareable between categories and use them to select useful parts out of a set of candidates. We apply the new representation to the task of scene cat-egorisation on the MIT Scene 67 benchmark. We show that our method can learn parts which are significantly more in-formative and for a fraction of the cost, compared to previ-ous part-learning methods such as Singh et al. [28]. We also show that a well constructed bag of words or Fisher vector model can substantially outperform the previous state-of-the-art classification performance on this data. 1.
(Show Context)

Citation Context

...3. Learning and classification Learning uses the PEGASOS SVM [27] algorithm, a linear SVM solver. In order to use non-linear additive kernels instead of the linear one, the χ2 explicit feature map of =-=[33]-=- is used (the bag of parts and bag of words histograms are l1 normalized). Using the feature map increases the dimension of the input feature vector by 3 times. For the IFV encoding, we use square-roo...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University