Results 1 - 10
of
1,010
Rich feature hierarchies for accurate object detection and semantic segmentation
"... Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scala ..."
Abstract
-
Cited by 251 (23 self)
- Add to MetaCart
(Show Context)
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that im-proves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural net-works (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. Source code for the complete system is available at
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
"... We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks an ..."
Abstract
-
Cited by 203 (22 self)
- Add to MetaCart
(Show Context)
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms. 1.
Caffe: Convolutional architecture for fast feature embedding
, 2014
"... Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose conv ..."
Abstract
-
Cited by 192 (8 self)
- Add to MetaCart
(Show Context)
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep mod-els efficiently on commodity architectures. Caffe fits indus-try and internet-scale media needs by CUDA GPU computa-tion, processing over 40 million images a day on a single K40 or Titan GPU ( ≈ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows ex-perimentation and seamless switching among platforms for ease of development and deployment from prototyping ma-chines to cloud environments. Caffe is maintained and developed by the Berkeley Vi-sion and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers on-going research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Representation learning: A review and new perspectives.
- of IEEE Conf. Comp. Vision Pattern Recog. (CVPR),
, 2005
"... Abstract-The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can b ..."
Abstract
-
Cited by 173 (4 self)
- Add to MetaCart
(Show Context)
Abstract-The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation, and manifold learning.
Overfeat: Integrated recognition, localization and detection using convolutional networks
- http://arxiv.org/abs/1312.6229
"... ar ..."
(Show Context)
Very deep convolutional networks for large-scale image recognition
, 2014
"... ar ..."
(Show Context)
Visualizing and understanding convolutional networks
- In Computer Vision–ECCV 2014
, 2014
"... Abstract. Large Convolutional Network models have recently demon-strated impressive classification performance on the ImageNet bench-mark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. ..."
Abstract
-
Cited by 133 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Large Convolutional Network models have recently demon-strated impressive classification performance on the ImageNet bench-mark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the oper-ation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets. 1
L.: Deepface: Closing the gap to human-level performance in face verification
- In: IEEE CVPR
, 2014
"... In modern face recognition, the conventional pipeline consists of four stages: detect ⇒ align ⇒ represent ⇒ clas-sify. We revisit both the alignment step and the representa-tion step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face represe ..."
Abstract
-
Cited by 103 (4 self)
- Add to MetaCart
(Show Context)
In modern face recognition, the conventional pipeline consists of four stages: detect ⇒ align ⇒ represent ⇒ clas-sify. We revisit both the alignment step and the representa-tion step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight shar-ing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an iden-tity labeled dataset of four million facial images belong-ing to more than 4,000 identities. The learned representa-tions coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.25 % on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 25%, closely approach-ing human-level performance. 1.
Sequence to sequence learning with neural networks
- in Advances in Neural Information Processing Systems, 2014
"... Deep Neural Networks (DNNs) are powerful models that have achieved excel-lent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approac ..."
Abstract
-
Cited by 76 (7 self)
- Add to MetaCart
(Show Context)
Deep Neural Networks (DNNs) are powerful models that have achieved excel-lent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-TermMemory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Fi-nally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. 1
Return of the Devil in the Details: Delving Deep into Convolutional Nets
, 2014
"... The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compar ..."
Abstract
-
Cited by 71 (8 self)
- Add to MetaCart
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in chal-lenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.