## Deep Learning with Kernel Regularization for Visual Recognition

### Cached

### Download Links

Citations: | 9 - 3 self |

### BibTeX

@MISC{Yu_deeplearning,

author = {Kai Yu and Wei Xu and Yihong Gong},

title = {Deep Learning with Kernel Regularization for Visual Recognition},

year = {}

}

### OpenURL

### Abstract

In this paper we aim to train deep neural networks for rapid visual recognition. The task is highly challenging, largely due to the lack of a meaningful regularizer on the functions realized by the networks. We propose a novel regularization method that takes advantage of kernel methods, where a given kernel function represents prior knowledge about the recognition task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate encouraging results on a wide range of recognition tasks, in terms of both accuracy and speed. 1

### Citations

944 | Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories
- Lazebnik, Schmid, et al.
- 2006
(Show Context)
Citation Context ...bject databases available today, and is probably the most popular benchmark for object recognition. We follow the common setting to train on 15 and 30 images per class and test on the rest. Following =-=[10]-=-, we limit the number of test images to 30 per class. The recognition accuracy was normalized by class sizes and evaluated over 5 random data splits. The CNN has the same architecture as the one used ... |

732 | Gradient-based learning applied to document recognition
- Bengio, Haffner
(Show Context)
Citation Context ...m ∂[η, ψ, θ] 4 Visual Recognition by Deep Learning with Kernel Regularization In the following, we apply the proposed strategy to train a class of deep models and convolutional neural networks (CNNs, =-=[11]-=-) for a range of visual recognition tasks including digit recognition on MNIST dataset, gender and ethnicity classification on the FRGC face dataset, and object recognition 5Table 1: Percentage error... |

463 | Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories
- Fei-Fei, Fergus, et al.
- 2004
(Show Context)
Citation Context ...20% 50% All SVM (RBF) 22.9 16.9 14.1 11.3 10.2 SVM (RBF, Nyström) 24.7 20.6 15.8 11.9 11.1 CNN 30.0 13.9 10.0 8.2 6.3 kCNN 15.6 8.7 7.3 6.2 5.8 4.3 Object Recognition on Caltech101 Dataset Caltech101 =-=[7]-=- contains 9144 images from 101 object categories and a background category. It is considered one of the most diverse object databases available today, and is probably the most popular benchmark for ob... |

337 |
Reducing the dimensionality of data with neural networks
- Hinton, Salakhutdinov
(Show Context)
Citation Context ...ble. Several authors have recently proposed training methods by using unlabeled data. These methods perform a greedy layer-wise pre-training using unlabeled data, followed by a supervised fine-tuning =-=[9, 3, 15]-=-. Even though the strategy notably improves the performance, to date, the best reported recognition accuracy on popular benchmarks such as Caltech101 by deep models is still largely behind the results... |

319 | A framework for learning predictive structures from multiple tasks and unlabeled data
- Ando, Zhang
- 2005
(Show Context)
Citation Context ...= Σ:,1V D 2 , where Σ:,1 is the m × m1 kernel matrix between all the m examples and the subset of size m1. Algorithm 1 Stochastic Gradient Descent repeat Generate a number a from uniform distribution =-=[0, 1]-=- if a < n m+n then Randomly pick a sample i ∈ {1, · · · , n} for L1, and update parameter by ∂L1(xi, β, θ) [β, θ] ← [β, θ] − ɛ ∂[β, θ] else Randomly pick a sample i ∈ {1, · · · , m} for L2, and update... |

283 | Using the Nyström method to speed up kernel machines
- Williams, Seeger
- 2001
(Show Context)
Citation Context ...g incomplete Cholesky decomposition on an m × m kernel matrix Σ. In the third case, when m is so large that the matrix decomposition cannot be computed in the main memory, we apply the Nyström method =-=[19]-=-: We first randomly sample m1 examples p < m1 < m, such that the computed kernel matrix Σ1 can be decomposed in the memory. Let V DV ⊤ be the prank eigenvalue decomposition of Σ1, then the p-rank deco... |

208 | Neighbourhood components analysis
- Goldberger, Roweis, et al.
- 2005
(Show Context)
Citation Context ...h using unlabeled data too, our work differs at the emphasis on leveraging the prior knowledge, which suggests that it can be combined with those approaches, including neighborhood component analysis =-=[8]-=-, to further enhance the deep learning. This work is also related to transfer learning [1], which uses auxiliary tasks to learn a linear feature mapping. The work here is motivated differently and dea... |

193 | Object recognition with features inspired by visual cortex
- Serre, Wolf, et al.
(Show Context)
Citation Context ...ernel regularization are visualized in Fig. 1, which helps to understand the difference made by kCNN. 5 Related Work, Discussion, and Conclusion Recent work on deep visual recognition models includes =-=[17, 12, 15]-=-. In [17] and [12] the first layer consists of hard-wired with Gabor filters, and then a large number of patches are sampled from the second layer and used as the basis of the representation which is ... |

184 | Greedy layer-wise training of deep networks
- Bengio, Lamblin, et al.
(Show Context)
Citation Context ...ble. Several authors have recently proposed training methods by using unlabeled data. These methods perform a greedy layer-wise pre-training using unlabeled data, followed by a supervised fine-tuning =-=[9, 3, 15]-=-. Even though the strategy notably improves the performance, to date, the best reported recognition accuracy on popular benchmarks such as Caltech101 by deep models is still largely behind the results... |

184 | Self-taught learning: Transfer learning from unlabeled data
- Raina, Battle, et al.
- 2007
(Show Context)
Citation Context ...ng some hand-crafted features computed from the input data, which corresponds to a case of a linear kernel function; (ii) U can be results of some unsupervised learning (e.g. the self-taught learning =-=[14]-=- based on sparse coding), applied on a large set of unlabeled data; (iii) If a nonlinear kernel function is available, U can be obtained by applying incomplete Cholesky decomposition on an m × m kerne... |

152 | Cluster kernels for semisupervised learning. Advances in neural information processing systems
- Chapelle, Weston, et al.
- 2002
(Show Context)
Citation Context ...en 200 filters of size 5×5, giving rise to 200 dimensional features that are fed to the output layer. Two nonlinear kernels are used: (1) RBF kernel, and (2) Graph kernel on 10 nearest neighbor graph =-=[5]-=-. SVM using RBF kernel reported very good results on MNIST [11], while graph kernel has showed excellent performances on USPS digit data [5]. We perform 600-dimension Cholesky decomposition on the who... |

140 |
Multiclass Object Recognition with Sparse, Localized Features
- Mutch, Lowe
(Show Context)
Citation Context ...same architecture as the one used in the FRGC experiment. The nonlinear kernel is the spatial pyramid matching (SPM) kernel developed in [10]. Tab. 4 shows our results together with those reported in =-=[12, 15]-=- using deep hierarchical architectures. The task is much more challenging than the previous three tasks for CNNs, because in each category the data size is very small while the visual patterns are hig... |

108 | Unsupervised learning of invariant feature hierarchies with applications to object recognition
- Ranzato, Huang, et al.
- 2007
(Show Context)
Citation Context ...ble. Several authors have recently proposed training methods by using unlabeled data. These methods perform a greedy layer-wise pre-training using unlabeled data, followed by a supervised fine-tuning =-=[9, 3, 15]-=-. Even though the strategy notably improves the performance, to date, the best reported recognition accuracy on popular benchmarks such as Caltech101 by deep models is still largely behind the results... |

66 |
S.P.: Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices
- Fazel, Hindi, et al.
- 2003
(Show Context)
Citation Context ...mented by SGD. In addition, due to the equivalence between minimizing L3(ψ, θ) and minimizing ∑q k=1 log(∑ m i=1 φ2 i,k + δ) + const, the model encourages the latent representations φi to be low-rank =-=[6]-=-. The SGD is described in Algorithm 1. In practice, the kernel matrix Σ = UU ⊤ that represents domain knowledge can be obtained in three different ways: (i) In the easiest case, U is directly availabl... |

47 | Preliminary face recognition grand challenge results
- Phillips, Flynn, et al.
(Show Context)
Citation Context ...he same kernels. The results are competitive with the state-of-the-art results by [15], and [18] of a different architecture. 4.2 Gender and Ethnicity Recognition on FRGC Dataset The FRGC 2.0 dataset =-=[13]-=- contains 568 individuals’ 14714 face images under various lighting conditions and backgrounds. Beside person identities, each image is annotated with gender and ethnicity, which we put into 3 classes... |

34 | Boosting sex identification performance
- Baluja, Rowley
- 2007
(Show Context)
Citation Context ...atures are fed to the output layer. The nonlinear kernel used in this experiment is the RBF kernel computed directly on images, which has demonstrated state-of-the-art accuracy for gender recognition =-=[2]-=-. The results shown in Tab. 2 and Tab. 3 demonstrate that kCNNs significantly boost the recognition accuracy of CNNs for both gender and ethnicity recognition. The difference is prominent when small t... |

34 | Deep learning via semi-supervised embedding
- Weston, Ratle, et al.
- 2008
(Show Context)
Citation Context ... SVM (Graph, Cholesky) 7.17 6.47 5.75 4.28 2.87 CNN 19.40 6.40 5.50 2.75 0.82 kCNN (RBF) 14.49 3.85 3.40 1.88 0.73 kCNN (Graph) 4.28 2.36 2.05 1.75 0.64 CNN (Pretrain) [15] − 3.21 − − 0.64 EmbedO CNN =-=[18]-=- 11.73 3.42 3.34 2.28 − EmbedI5 CNN [18] 7.75 3.82 2.73 1.83 − EmbedA1 CNN [18] 7.87 3.82 2.76 2.07 − on the Caltech101 dataset. In each of these tasks, we choose a kernel function that has been repor... |

23 | Semi-supervised learning of compact document representations with deep networks
- Ranzato, Szummer
- 1951
(Show Context)
Citation Context ...h layer-wise unsupervised pretraining, followed by supervised fine-tuning [9]. The strategy was subsequently studied for other deep models like CNNs [15], autoassociators [3], and for document coding =-=[16]-=-. In recent work [18], the authors proposed training a deep model jointly with an unsupervised embedding task, which also leads to improved results. Though using unlabeled data too, our work differs a... |

21 | Training Hierarchical Feed-Forward Visual Recognition Models using Transfer Learning from Pseudo-Tasks
- Ahmed, Yu, et al.
- 2008
(Show Context)
Citation Context ...= Σ:,1V D 2 , where Σ:,1 is the m × m1 kernel matrix between all the m examples and the subset of size m1. Algorithm 1 Stochastic Gradient Descent repeat Generate a number a from uniform distribution =-=[0, 1]-=- if a < n m+n then Randomly pick a sample i ∈ {1, · · · , n} for L1, and update parameter by ∂L1(xi, β, θ) [β, θ] ← [β, θ] − ɛ ∂[β, θ] else Randomly pick a sample i ∈ {1, · · · , m} for L2, and update... |

12 |
Image classification using rois and multiple kernel learning,” Intl
- Bosch, Zisserman, et al.
(Show Context)
Citation Context ... needs 6 seconds to process a new image on a PC with a 3.0 GHz processor, while in the same amount of time kCNN can process about 240 images. The latest record on Caltech101 combined multiple kernels =-=[4]-=-. We conjecture that kCNN could be further improved by using multiple kernels without sacrificing recognition speed. Conclusion: We proposed using kernels to improve the training of deep models. The a... |