Results 1 - 10
of
22
Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation
- In ICCV
, 2009
"... Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neig ..."
Abstract
-
Cited by 23 (8 self)
- Add to MetaCart
Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a weighted nearest-neighbor model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. We investigate the performance of different variants of our model and compare to existing work. We present experimental results for three challenging data sets. On all three, TagProp makes a marked improvement as compared to the current state-of-the-art. 1.
Descriptor Learning for Efficient Retrieval
"... Abstract. Many visual search and matching systems represent images using sparse sets of “visual words”: descriptors that have been quantized by assignment to the best-matching symbol in a discrete vocabulary. Errors in this quantization procedure propagate throughout the rest of the system, either h ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Abstract. Many visual search and matching systems represent images using sparse sets of “visual words”: descriptors that have been quantized by assignment to the best-matching symbol in a discrete vocabulary. Errors in this quantization procedure propagate throughout the rest of the system, either harming performance or requiring correction using additional storage or processing. This paper aims to reduce these quantization errors at source, by learning a projection from descriptor space to a new Euclidean space in which standard clustering techniques are more likely to assign matching descriptors to the same cluster, and non-matching descriptors to different clusters. To achieve this, we learn a non-linear transformation model by minimizing a novel margin-based cost function, which aims to separate matching descriptors from two classes of non-matching descriptors. Training data is generated automatically by leveraging geometric consistency. Scalable, stochastic gradient methods are used for the optimization. For the case of particular object retrieval, we demonstrate impressive gains in performance on a ground truth dataset: our learnt 32-D descriptor without spatial re-ranking outperforms a baseline method using 128-D SIFT descriptors with spatial re-ranking. 1
Beyond Simple Features: A Large-Scale Feature Search Approach to Unconstrained Face Recognition
"... Abstract — Many modern computer vision algorithms are built atop of a set of low-level feature operators (such as SIFT [1], [2]; HOG [3], [4]; or LBP [5], [6]) that transform raw pixel values into a representation better suited to subsequent processing and classification. While the choice of feature ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Abstract — Many modern computer vision algorithms are built atop of a set of low-level feature operators (such as SIFT [1], [2]; HOG [3], [4]; or LBP [5], [6]) that transform raw pixel values into a representation better suited to subsequent processing and classification. While the choice of feature representation is often not central to the logic of a given algorithm, the quality of the feature representation can have critically important implications for performance. Here, we demonstrate a large-scale feature search approach to generating new, more powerful feature representations in which a multitude of complex, nonlinear, multilayer neuromorphic feature representations are randomly generated and screened to find those best suited for the task at hand. In particular, we show that a brute-force search can generate representations that, in combination with standard machine learning blending techniques, achieve state-of-the-art performance on the Labeled Faces in the Wild (LFW) [7] unconstrained face recognition challenge set. These representations outperform previous stateof-the-art approaches, in spite of requiring less training data and using a conceptually simpler machine learning backend. We argue that such large-scale-search-derived feature sets can play a synergistic role with other computer vision approaches by providing a richer base of features with which to work. I.
Multiple instance metric learning from automatically labeled bags of faces
- In Proc. ECCV
, 2010
"... Abstract. Metric learning aims at finding a distance that approximates a task-specific notion of semantic similarity. Typically, a Mahalanobis distance is learned from pairs of data labeled as being semantically similar or not. In this paper, we learn such metrics in a weakly supervised setting wher ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. Metric learning aims at finding a distance that approximates a task-specific notion of semantic similarity. Typically, a Mahalanobis distance is learned from pairs of data labeled as being semantically similar or not. In this paper, we learn such metrics in a weakly supervised setting where “bags ” of instances are labeled with “bags ” of labels. We formulate the problem as a multiple instance learning (MIL) problem over pairs of bags. If two bags share at least one label, we label the pair positive, and negative otherwise. We propose to learn a metric using those labeled pairs of bags, leading to MildML, for multiple instance logistic discriminant metric learning. MildML iterates between updates of the metric and selection of putative positive pairs of examples from positive pairs of bags. To evaluate our approach, we introduce a large and challenging data set, Labeled Yahoo! News, which we have manually annotated and contains 31147 detected faces of 5873 different people in 20071 images. We group the faces detected in an image into a bag, and group the names detected in the caption into a corresponding set of labels. When the labels come from manual annotation, we find that MildML using the bag-level annotation performs as well as fully supervised metric learning using instance-level annotation. We also consider performance in the case of automatically extracted labels for the bags, where some of the bag labels do not correspond to any example in the bag. In this case MildML works substantially better than relying on noisy instance-level annotations derived from the bag-level annotation by resolving face-name associations in images with their captions. 1
Automatically Identifying Join Candidates in the Cairo Genizah
"... A join is a set of manuscript-fragments that are known to originate from the same original work. The Cairo Genizah is a collection containing approximately 250,000 fragments of mainly Jewish texts discovered in the late 19th century. The fragments are today spread out in libraries and private collec ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
A join is a set of manuscript-fragments that are known to originate from the same original work. The Cairo Genizah is a collection containing approximately 250,000 fragments of mainly Jewish texts discovered in the late 19th century. The fragments are today spread out in libraries and private collections worldwide, and there is an onging effort to document and catalogue all extant fragments. The task of finding joins is currently conducted manually by experts, and presumably only a small fraction of the existing joins have been discovered. In this work, we study the problem of automatically finding candidate joins, so as to streamline the task. The proposed method is based on a combination of local descriptors and learning techniques. To evaluate the performance of various join-finding methods, without relying on the availability of human experts, we construct a benchmark dataset that is modeled on the Labeled Faces in the Wild benchmark for face recognition. Using this benchmark, we evaluate several alternative image representations and learning techniques. Finally, a set of newly-discovered join-candidates have been identified using our method and validated by a human expert. 1.
INRIA-LEARs participation to ImageCLEF 2009. CLEF working notes 2009
, 2009
"... We participated in the Photo Annotation and Photo Retrieval tasks of ImageCLEF 2009. For the Photo Annotation task we compared TagProp, SVMs, and logistic discriminant (LD) models. TagProp is a nearest-neighbor based system that learns a distance measure between images to define the neighbors. In th ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We participated in the Photo Annotation and Photo Retrieval tasks of ImageCLEF 2009. For the Photo Annotation task we compared TagProp, SVMs, and logistic discriminant (LD) models. TagProp is a nearest-neighbor based system that learns a distance measure between images to define the neighbors. In the second system a separate SVM is trained for each annotation word. The third system treats mutually exclusive terms more naturally by assigning a probabilities to the mutually exclusive terms that sum up to one. The experiments show that (i) both TagProp and SVMs benefit from a distance combination learned with TagProp, (ii) the TagProp system, which has very few trainable parameters, performs somewhat worse than SVM in terms of EEC and AUC but better than the SVM runs in terms of the hierarchical image annotation score (HS), and (iii) LD is best in terms of HS and close to the SVM run in terms of EEC and AUC. In our experiments for the Photo Retrieval task we compare a system using only visual search, with systems that include a simple form of text matching, and/or duplicate removal to increase the diversity in the search results. For the visual search we use our image matching system that is efficient and yields state-of-the-art image retrieval results. From the evaluation of the results we find that the adding some form of text matching is crucial for retrieval, and that (unexpectedly) the duplicate removal step did not improve results.
Learning Class-Specific Image Transformations with Higher-Order Boltzmann Machines
"... In this paper, we examine the problem of learning a representation of image transformations specific to a complex object class, such as faces. Learning such a representation for a specific object class would allow us to perform improved, pose-invariant visual verification, such as unconstrained face ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, we examine the problem of learning a representation of image transformations specific to a complex object class, such as faces. Learning such a representation for a specific object class would allow us to perform improved, pose-invariant visual verification, such as unconstrained face verification. We build off of the method of using factored higher-order Boltzmann machines to model such image transformations. Using this approach will potentially enable us to use the model as one component of a larger deep architecture. This will allow us to use the feature information in an ordinary deep network to perform better modeling of transformations, and to infer pose estimates from the hidden representation. We focus on applying these higher-order Boltzmann machines to the NORB 3D objects data set and the Labeled Faces in the Wild face data set. We first show two different approaches to using this method on these object classes, demonstrating that while some useful transformation information can be extracted, ultimately the simple direct application of these models to higher-resolution, complex object classes is insufficient to achieve improved visual verification performance. Instead, we believe that this method should be integrated into a larger deep architecture, and show initial results using the higher-order Boltzmann machine as the second layer of a deep architecture, above a first layer convolutional RBM. 1.
Enforcing Similarity Constraints with Integer Programming for Better Scene Text Recognition
"... The recognition of text in everyday scenes is made difficult by viewing conditions, unusual fonts, and lack of linguistic context. Most methods integrate a priori appearance information and some sort of hard or soft constraint on the allowable strings. Weinman and Learned-Miller [14] showed that the ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The recognition of text in everyday scenes is made difficult by viewing conditions, unusual fonts, and lack of linguistic context. Most methods integrate a priori appearance information and some sort of hard or soft constraint on the allowable strings. Weinman and Learned-Miller [14] showed that the similarity among characters, as a supplement to the appearance of the characters with respect to a model, could be used to improve scene text recognition. In this work, we make further improvements to scene text recognition by taking a novel approach to the incorporation of similarity. In particular, we train a similarity expert that learns to classify each pair of characters as equivalent or not. After removing logical inconsistencies in an equivalence graph, we formulate the search for the maximum likelihood interpretation of a sign as an integer program. We incorporate the equivalence information as constraints in the integer program and build an optimization criterion out of appearance features and character bigrams. Finally, we take the optimal solution from the integer program, and compare all “nearby ” solutions using a probability model for strings derived from search engine queries. We demonstrate word error reductions of more than 30 % relative to previous methods on the same data set. 1.
1 Face Verification Using the LARK Representation
"... Abstract—We present a novel face representation based on locally adaptive regression kernel (LARK) descriptors [1]. Our LARK descriptor measures a self-similarity based on “signal-induced distance ” between a center pixel and surrounding pixels in a local neighborhood. By applying principal componen ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract—We present a novel face representation based on locally adaptive regression kernel (LARK) descriptors [1]. Our LARK descriptor measures a self-similarity based on “signal-induced distance ” between a center pixel and surrounding pixels in a local neighborhood. By applying principal component analysis (PCA) and a logistic function to LARK consecutively, we develop a new binary-like face representation which achieves state of the art face verification performance on the challenging benchmark “Labeled Faces in the Wild ” (LFW) dataset [2]. In the case where training data are available, we employ one-shot similarity (OSS) [3], [4] based on linear discriminant analysis (LDA) [5]. The proposed approach achieves state of the art performance on both the unsupervised setting and the image restrictive training setting (72.23 % and 78.90 % verification rates) respectively as a single descriptor representation, with no preprocessing step. As opposed to [4] which combined 30 distances to achieve 85.13%, we achieve comparable performance (85.1%) with only 14 distances while significantly reducing computational complexity.
Learning Hierarchical Representations for Face Verification with Convolutional Deep Belief Networks
"... Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining additional, complementary representations. To learn features in high-resolution images, we make use of convolutional deep belief networks. Moreover, to take advantage of global structure in an object class, we develop local convolutional restricted Boltzmann machines, a novel convolutional learning model that exploits the global structure by not assuming stationarity of features across the image, while maintaining scalability and robustness to small misalignments. We also present a novel application of deep learning to descriptors other than pixel intensity values, such as LBP. In addition, we compare performance of networks trained using unsupervised learning against networks with random filters, and empirically show that learning weights not only is necessary for obtaining good multilayer representations, but also provides robustness to the choice of the network architecture parameters. Finally, we show that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors. Moreover, by combining these representations, we achieve state-of-the-art results on a real-world face verification database. 1.

