Results 1 - 10
of
11
Beyond the Euclidean distance: Creating effective visual codebooks using the histogram intersection kernel
"... Common visual codebook generation methods used in a Bag of Visual words model, e.g. k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Hist ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Common visual codebook generation methods used in a Bag of Visual words model, e.g. k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks with histogram features. In this paper, we demonstrate that HIK can also be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as k-means. The HIK codebook has consistently higher recognition accuracy over k-means codebooks by 2-4%. In addition, we propose a one-class SVM formulation to create more effective visual code words which can achieve even higher accuracy. The proposed method has established new state-of-the-art performance numbers for 3 popular benchmark datasets on object and scene recognition. In addition, we show that the standard k-median clustering method can be used for visual codebook generation and can act as a compromise between HIK and k-means approaches. 1.
Towards Semantic Embedding in Visual Vocabulary
"... Visual vocabulary serves as a fundamental component in many computer vision tasks, such as object recognition, visual search, and scene modeling. While state-of-the-art approaches build visual vocabulary based solely on visual statistics of local image patches, the correlative image labels are left ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Visual vocabulary serves as a fundamental component in many computer vision tasks, such as object recognition, visual search, and scene modeling. While state-of-the-art approaches build visual vocabulary based solely on visual statistics of local image patches, the correlative image labels are left unexploited in generating visual words. In this work, we present a semantic embedding framework to integrate semantic information from Flickr labels for supervised vocabulary construction. Our main contribution is a Hidden Markov Random Field modeling to supervise feature space quantization, with specialized considerations to label correlations: Local visual features are modeled as an Observed Field, which follows visual metrics to partition feature space. Semantic labels are modeled as a Hidden Field, which imposes generative supervision to the Observed Field with WordNet-based correlation constraints as Gibbs distribution. By simplifying the Markov property in the Hidden Field, both unsupervised and supervised (label independent) vocabularies can be derived from our framework. We validate our performances in two challenging computer vision tasks with comparisons to state-of-the-arts: (1) Large-scale image search on a Flickr 60,000 database; (2) Object recognition on the PASCAL VOC database. 1.
Sparse Dictionary-based Representation and Recognition of Action Attributes
"... We present an approach for dictionary learning of action attributes via information maximization. We unify the class distribution and appearance information into an objective function for learning a sparse dictionary of action attributes. The objective function maximizes the mutual information betwe ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We present an approach for dictionary learning of action attributes via information maximization. We unify the class distribution and appearance information into an objective function for learning a sparse dictionary of action attributes. The objective function maximizes the mutual information between what has been learned and what remains to be learned in terms of appearance information and class distribution for each dictionary item. We propose a Gaussian Process (GP) model for sparse representation to optimize the dictionary objective function. The sparse coding property allows a kernel with a compact support in GP to realize a very efficient dictionary learning process. Hence we can describe an action video by a set of compact and discriminative action attributes. More importantly, we can recognize modeled action categories in a sparse feature space, which can be generalized to unseen and unmodeled action categories. Experimental results demonstrate the effectiveness of our approach in action recognition applications. 1.
Optimizing Visual Vocabularies Using Soft Assignment Entropies
"... Abstract. The state of the art for large database object retrieval in images is based on quantizing descriptors of interest points into visual words. High similarity between matching image representations (as bags of words) is based upon the assumption that matched points in the two images end up in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. The state of the art for large database object retrieval in images is based on quantizing descriptors of interest points into visual words. High similarity between matching image representations (as bags of words) is based upon the assumption that matched points in the two images end up in similar words in hard assignment or in similar representations in soft assignment techniques. In this paper we study how ground truth correspondences can be used to generate better visual vocabularies. Matching of image patches can be done e.g. using deformable models or from estimating 3D geometry. For optimization of the vocabulary, we propose minimizing the entropies of soft assignment of points. We base our clustering on hierarchical k-splits. The results from our entropy based clustering are compared with hierarchical k-means. The vocabularies have been tested on real data with decreased entropy and increased true positive rate, as well as better retrieval performance. 1
Image Matching with Distinctive Visual Vocabulary
"... In this paper we propose an image indexing and matching algorithm that relies on selecting distinctive high dimensional features. In contrast with conventional techniques that treated all features equally, we claim that one can benefit significantly from focusing on distinctive features. We propose ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
In this paper we propose an image indexing and matching algorithm that relies on selecting distinctive high dimensional features. In contrast with conventional techniques that treated all features equally, we claim that one can benefit significantly from focusing on distinctive features. We propose a bag-of-words algorithm that combines the feature distinctiveness in visual vocabulary generation. Our approach compares favorably with the state of the art in image matching tasks on the University of Kentucky Recognition Benchmark dataset and on an indoor localization dataset. We also show that our approach scales up more gracefully on a large scale Flickr dataset. 1.
Fall Term
"... Visual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective mo ..."
Abstract
- Add to MetaCart
Visual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into visual words (i.e., mid-level features) based on their appearance similarity using clustering, has been widely and successfully explored. The advantages of this representation are: no explicit detection of objects or object parts and their tracking are required; the representation is somewhat tolerant to within-class deformations, and it is efficient for matching. However, the performance of the BoVW is sensitive to the size of the visual vocabulary. Therefore, computationally expensive cross-validation is needed to find the appropriate quantization granularity. This limitation is partially due to the fact that the visual words are not semantically meaningful. This limits the effectiveness and compactness of the representation.
A Discriminative Key Pose Sequence Model for Recognizing Human Interactions
"... In this paper we develop a model for recognizing human interactions – activity recognition with multiple actors. An activity is modeled with a sequence of key poses, important atomic-level actions performed by the actors. Spatial arrangements between the actors are included in the model, as is a str ..."
Abstract
- Add to MetaCart
In this paper we develop a model for recognizing human interactions – activity recognition with multiple actors. An activity is modeled with a sequence of key poses, important atomic-level actions performed by the actors. Spatial arrangements between the actors are included in the model, as is a strict temporal ordering of the key poses. An exemplar representation is used to model the variability in the instantiation of key poses. Quantitative results that form a new state-of-the-art on the benchmark UT-Interaction dataset are presented, along with results on a subset of the TRECVID dataset. 1.
Supervised Feature Quantization with Entropy Optimization
"... Feature quantization is a crucial component for efficient large scale image retrieval and object recognition. By quantizing local features into visual words, one hopes that features that match each other obtain the same word ID. Then, similarities between images can be measured with respect to the c ..."
Abstract
- Add to MetaCart
Feature quantization is a crucial component for efficient large scale image retrieval and object recognition. By quantizing local features into visual words, one hopes that features that match each other obtain the same word ID. Then, similarities between images can be measured with respect to the corresponding histograms of visual words. Given the appearance variations of local features, traditional quantization methods do not take into account the distribution of matched features. In this paper, we investigate how to encode additional prior information on the feature distribution via entropy optimization by leveraging ground truth correspondence data. We propose a computationally efficient optimization scheme for large scale vocabulary training. The results from our experiments suggest that entropyoptimized vocabulary performs better than unsupervised quantization methods in terms of recall and precision for feature matching. We also demonstrate the advantage of the optimized vocabulary for image retrieval. 1.
Submodular Dictionary Learning for Sparse Coding
"... A greedy-based approach to learn a compact and discriminative dictionary for sparse representation is presented. We propose an objective function consisting of two components: entropy rate of a random walk on a graph and a discriminative term. Dictionary learning is achieved by finding a graph topol ..."
Abstract
- Add to MetaCart
A greedy-based approach to learn a compact and discriminative dictionary for sparse representation is presented. We propose an objective function consisting of two components: entropy rate of a random walk on a graph and a discriminative term. Dictionary learning is achieved by finding a graph topology which maximizes the objective function. By exploiting the monotonicity and submodularity properties of the objective function and the matroid constraint, we present a highly efficient greedy-based optimization algorithm. It is more than an order of magnitude faster than several recently proposed dictionary learning approaches. Moreover, the greedy algorithm gives a near-optimal solution with a (1/2)-approximation bound. Our approach yields dictionaries having the property that feature points from the same class have very similar sparse codes. Experimental results demonstrate that our approach outperforms several recently proposed dictionary learning
Learning Mid-Level features . . .
"... Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise tra ..."
Abstract
- Add to MetaCart
Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pooling step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pooling schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the average, or the maximum), which obtains state-of-the-art performance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature extractors, our approach aims to facilitate the design of better recognition architectures.

