## An Analysis of Single-Layer Networks in Unsupervised Feature Learning

### Cached

### Download Links

- [www.robotics.stanford.edu]
- [ai.stanford.edu]
- [www.eecs.umich.edu]
- [ai.stanford.edu]
- [www.robotics.stanford.edu]
- [web.eecs.umich.edu]
- [www.cs.stanford.edu]
- [jmlr.csail.mit.edu]
- [www.jmlr.org]
- [jmlr.org]
- [www.robotics.stanford.edu]
- [www.cs.stanford.edu]
- [stanford.edu]
- [www.eecs.umich.edu]
- [ai.stanford.edu]
- [people.ee.duke.edu]
- [ai.stanford.edu]
- [www.stanford.edu]
- [www.stanford.edu]
- [www.robotics.stanford.edu]
- [web.eecs.umich.edu]
- [stanford.edu]
- [robotics.stanford.edu]
- [www.cs.stanford.edu]
- [www.cs.stanford.edu]
- [www.stanford.edu]
- [www.stanford.edu]
- [stanford.edu]

Citations: | 83 - 16 self |

### BibTeX

@MISC{Coates_ananalysis,

author = {Adam Coates and Honglak Lee and Andrew Y. Ng},

title = {An Analysis of Single-Layer Networks in Unsupervised Feature Learning},

year = {}

}

### OpenURL

### Abstract

A great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several very simple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (“stride”) between extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itself—so critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-theart performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure itself, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6 % and 97.0 % accuracy respectively). 1

### Citations

1033 | Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories
- Lazebnik, Schmid, et al.
(Show Context)
Citation Context ...thm as an alternative unsupervised learning module. K-means has been used less widely in “deep learning” work but has enjoyed wide adoption in computer vision for building codebooks of “visual words” =-=[4, 5, 13, 29]-=-, which are used to define higher-level image features. This method has also been applied recursively to build multiple layers of features [1]. The effects of pooling and choice of activation function... |

975 |
Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images
- Olshausen, Field
- 1996
(Show Context)
Citation Context ...d. Most have focused on creating new training algorithms to build single-layer models that are composed to build deeper structures. Among the algorithms considered in the literature are sparse-coding =-=[20, 15, 30]-=-, RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others [21, 17, 32]. Thus, amongst the many... |

626 | Visual categorization with bags of keypoints
- Csurka, Dance, et al.
(Show Context)
Citation Context ...thm as an alternative unsupervised learning module. K-means has been used less widely in “deep learning” work but has enjoyed wide adoption in computer vision for building codebooks of “visual words” =-=[4, 5, 13, 29]-=-, which are used to define higher-level image features. This method has also been applied recursively to build multiple layers of features [1]. The effects of pooling and choice of activation function... |

582 | A bayesian hierarchical model for learning natural scene categories
- Li, Pietro
- 2005
(Show Context)
Citation Context ...thm as an alternative unsupervised learning module. K-means has been used less widely in “deep learning” work but has enjoyed wide adoption in computer vision for building codebooks of “visual words” =-=[4, 5, 13, 29]-=-, which are used to define higher-level image features. This method has also been applied recursively to build multiple layers of features [1]. The effects of pooling and choice of activation function... |

539 | Training products of experts by minimizing contrastive divergence
- Hinton
(Show Context)
Citation Context ...ed Boltzmann machine: The restricted Boltzmann machine (RBM) is an undirected graphical model with K binary hidden variables. Sparse RBMs can be trained using the contrastive divergence approximation =-=[7]-=- with the same type of sparsity penalty as the auto-encoders. The training also produces weights W and biases b, and we can use the same feature mapping as the auto-encoder (as in Equation (1))—thus, ... |

513 | Independent component analysis: Algorithms and applications
- Hyvärinen, Oja
- 2000
(Show Context)
Citation Context ...ndard deviation of its elements. For visual data, this corresponds to local brightness and contrast normalization. After normalizing each input vector, the entire dataset X may optionally be whitened =-=[9]-=-. While this process is commonly used in deep learning work (e.g., [23]) it is less frequently employed in computer vision. We will present experimental results obtained both with and without whitenin... |

499 | A fast learning algorithm for deep belief nets
- Hinton, Osindero, et al.
(Show Context)
Citation Context ...h as classification. Current solutions typically learn multi-level representations by greedily “pre-training” several layers of features, one layer at a time, using an unsupervised learning algorithm =-=[10, 8, 16]-=-. For each of these layers a number of design parameters are chosen: the number of features to learn, the locations where these features will be computed, and how to encode the inputs and outputs of t... |

499 | The independent components of natural scenes are edge filters
- Bell, Sejnowski
- 1997
(Show Context)
Citation Context ...ing algorithms, however, we see that whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data. 7 6 In our experiments, we use Zero-phase whitening =-=[2]-=- 7 Our GMM implementation uses diagonal covariances 80 75 70 65 219An Analysis of Single-Layer Networks in Unsupervised Feature Learning (a) K-means (with and without whitening) (b) GMM (with and wit... |

219 | Efficient sparse coding algorithms
- Lee, Battle, et al.
- 2006
(Show Context)
Citation Context ...d. Most have focused on creating new training algorithms to build single-layer models that are composed to build deeper structures. Among the algorithms considered in the literature are sparse-coding =-=[20, 15, 30]-=-, RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others [21, 17, 32]. Thus, amongst the many... |

212 | T.: Object categorization by learned universal visual dictionary
- Winn, Criminisi, et al.
(Show Context)
Citation Context |

194 | Linear spatial pyramid matching using sparse coding for image classification
- Yang, Yu, et al.
- 2009
(Show Context)
Citation Context ...one hand, these results are somewhat unsurprising. For instance, it is widely held that highly over-complete feature representations tend to give better performance than smaller-sized representations =-=[30]-=-, and similarly with small strides between features [19]. However, the main contribution of our work is demonstrating that these considerations may, in fact, be critical to the success of feature lear... |

179 | Sampling strategies for bag-of-features image classification
- Nowak, Jurie, et al.
- 2006
(Show Context)
Citation Context ...nstance, it is widely held that highly over-complete feature representations tend to give better performance than smaller-sized representations [30], and similarly with small strides between features =-=[19]-=-. However, the main contribution of our work is demonstrating that these considerations may, in fact, be critical to the success of feature learning algorithms—potentially more important even than the... |

173 | Learning Methods for Generic Object Recognition with Invariance to Pose and
- LeCun, Huang, et al.
- 2004
(Show Context)
Citation Context ...chosen through cross-validation, thus increasing running times dramatically. Though it is true that recently introduced algorithms have consistently shown improvements on benchmark datasets like NORB =-=[14]-=- and CIFAR [11], there are several other factors that affect the final performance of a feature learning system. Specifically, 1there are many “meta-parameters” defining the network architecture, suc... |

170 | Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations
- Lee, Grosse, et al.
- 2009
(Show Context)
Citation Context ...ature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others =-=[21, 17, 32]-=-. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavily scrutinized. Some work, however, has considered the impact of oth... |

132 | Efficient Learning of Sparse Representations with an Energy-Based Model
- Ranzato, Poultney, et al.
(Show Context)
Citation Context ...ingle-layer models that are composed to build deeper structures. Among the algorithms considered in the literature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders =-=[6, 24]-=-, denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others [21, 17, 32]. Thus, amongst the many components of feature learning architectures, the unsupervise... |

120 | What is the best multistage architecture for object recognition
- Jarrett, Kavukcuoglu, et al.
- 2009
(Show Context)
Citation Context ...h as classification. Current solutions typically learn multi-level representations by greedily “pre-training” several layers of features, one layer at a time, using an unsupervised learning algorithm =-=[10, 8, 16]-=-. For each of these layers a number of design parameters are chosen: the number of features to learn, the locations where these features will be computed, and how to encode the inputs and outputs of t... |

114 | Learning multiple layers of features from tiny images
- Krizhevsky
- 2009
(Show Context)
Citation Context ...cross-validation, thus increasing running times dramatically. Though it is true that recently introduced algorithms have consistently shown improvements on benchmark datasets like NORB [14] and CIFAR =-=[11]-=-, there are several other factors that affect the final performance of a feature learning system. Specifically, 1there are many “meta-parameters” defining the network architecture, such as the recept... |

101 |
Learning mid-level features for recognition
- Boureau, Bach, et al.
- 2010
(Show Context)
Citation Context ...s well as different forms of normalization and rectification between layers. Similarly, Boureau et al. have considered the impact of coding strategies and different types of pooling, both in practice =-=[3]-=- and in theory [4]. Our work follows in this vein, but considers instead the structure of single-layer networks—before pooling, and orthogonal to the choice of algorithm or coding scheme. Many common ... |

85 | Extracting and composing robust features with denoising autoencoders
- Vincent, Larochelle, et al.
- 2008
(Show Context)
Citation Context ...sed to build deeper structures. Among the algorithms considered in the literature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders =-=[28]-=-, “factored” [23] and mean-covariance [22] RBMs, as well as many others [21, 17, 32]. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to b... |

83 | Sparse deep belief net model for visual area V2
- Lee, Ekanadham, et al.
- 2008
(Show Context)
Citation Context ...h as classification. Current solutions typically learn multi-level representations by greedily “pre-training” several layers of features, one layer at a time, using an unsupervised learning algorithm =-=[10, 8, 16]-=-. For each of these layers a number of design parameters are chosen: the number of features to learn, the locations where these features will be computed, and how to encode the inputs and outputs of t... |

79 | Kernel codebooks for scene categorization
- Gemert, Geusebroek, et al.
(Show Context)
Citation Context ...thod has also been applied recursively to build multiple layers of features [1]. The effects of pooling and choice of activation function or coding scheme have similarly been studied for these models =-=[13, 26, 19]-=-. Van Gemert et al., for instance, demonstrate that “soft” activation functions (“kernels”) tend to work better than the hard assignment typically used with visual words models. This paper will compar... |

76 | Sparse feature learning for deep belief networks
- Ranzato, Boureau, et al.
- 2007
(Show Context)
Citation Context ...ature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others =-=[21, 17, 32]-=-. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavily scrutinized. Some work, however, has considered the impact of oth... |

59 | Hyperfeatures - multilevel local coding for visual recognition
- Agarwal, Triggs
- 2006
(Show Context)
Citation Context ...ision for building codebooks of “visual words” [4, 5, 13, 29], which are used to define higher-level image features. This method has also been applied recursively to build multiple layers of features =-=[1]-=-. The effects of pooling and choice of activation function or coding scheme have similarly been studied for these models [13, 26, 19]. Van Gemert et al., for instance, demonstrate that “soft” activati... |

56 | Nonlinear learning using local coordinate coding
- Yu, Zhang, et al.
(Show Context)
Citation Context ...ature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others =-=[21, 17, 32]-=-. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavily scrutinized. Some work, however, has considered the impact of oth... |

41 | A statistical approach to material classification using image patch exemplars
- Varma, Zisserman
(Show Context)
Citation Context ... However, these visualizations also show that similar results can be achieved using clustering algorithms. In particular, while clustering raw data leads to centroids consistent with those in [5] and =-=[27]-=-, we see that clustering whitened data yields sharply localized filters that are very similar to those learned by the other algorithms. Thus, it appears that such features are easy to learn with clust... |

36 | 3D Object Recognition with Deep Belief Nets
- Nair, Hinton
- 1347
(Show Context)
Citation Context ...n accuracy (and error) for NORB (normalized-uniform) Algorithm Test accuracy (and error) Convolutional Neural Networks [14] 93.4% (6.6%) Deep Boltzmann Machines [25] 92.8% (7.2%) Deep Belief Networks =-=[18]-=- 95.0% (5.0%) (Best result of [10]) 94.4% (5.6%) K-means (Triangle) 97.0% (3.0%) K-means (Hard) 96.9% (3.1%) Sparse auto-encoder 96.9% (3.1%) Sparse RBM 96.2% (3.8%) 4.6 Final classification results T... |

31 | Measuring invariances in deep networks
- Goodfellow, Le, et al.
- 2009
(Show Context)
Citation Context ...ingle-layer models that are composed to build deeper structures. Among the algorithms considered in the literature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders =-=[6, 24]-=-, denoising auto-encoders [28], “factored” [23] and mean-covariance [22] RBMs, as well as many others [21, 17, 32]. Thus, amongst the many components of feature learning architectures, the unsupervise... |

25 |
G.: Modeling Pixel Means and Covariances Using Factorized Third-Order Boltzmann Machines
- Ranzato, Hinton
(Show Context)
Citation Context ...algorithms considered in the literature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” [23] and mean-covariance =-=[22]-=- RBMs, as well as many others [21, 17, 32]. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavily scrutinized. Some work,... |

24 | Convolutional deep belief networks on cifar-10. Unpublished manuscript
- Krizhevsky
- 2010
(Show Context)
Citation Context ... [11]) 37.3% RBM with backpropagation [11] 64.8% 3-Way Factored RBM + ZCA (3 layers) [23] 65.3% Mean-covariance RBM (3 layers) [22] 71.0% Improved Local Coordinate Coding [31] 74.5% Convolutional RBM =-=[12]-=- 78.9% K-means (Triangle) 77.9% K-means (Hard) 68.6% Sparse auto-encoder 73.4% Sparse RBM 72.4% K-means (Triangle, 4k features) 79.6% We have shown that whitening, a stride of 1 pixel, a 6 pixel recep... |

24 | Improved local coordinate coding using local tangents
- Yu, Zhang
- 2010
(Show Context)
Citation Context ...uracy Raw pixels (reported in [11]) 37.3% RBM with backpropagation [11] 64.8% 3-Way Factored RBM + ZCA (3 layers) [23] 65.3% Mean-covariance RBM (3 layers) [22] 71.0% Improved Local Coordinate Coding =-=[31]-=- 74.5% Convolutional RBM [12] 78.9% K-means (Triangle) 77.9% K-means (Hard) 68.6% Sparse auto-encoder 73.4% Sparse RBM 72.4% K-means (Triangle, 4k features) 79.6% We have shown that whitening, a strid... |

21 | Learning convolutional feature hierarchies for visual recognition
- Kavukcuoglu, Sermanet, et al.
- 2010
(Show Context)
Citation Context ...include shifted copies of edges, increasing the receptive field size also increases the amount of redundancy we can expect in our filters. This caveat might be ameliorated by training convolutionally =-=[19, 16, 12]-=-. Note that small receptive fields might also increase the number of samples used in pooling and thus have a small effect similar to using a smaller stride. 6 Conclusion In this paper we have conducte... |

13 | A theoretical analysis of feature pooling in visual recognition
- Boureau, Ponce, et al.
- 2010
(Show Context)
Citation Context ...t forms of normalization and rectification between layers. Similarly, Boureau et al. have considered the impact of coding strategies and different types of pooling, both in practice [2] and in theory =-=[3]-=-. Our work follows in this vein, but considers instead the structure of single-layer networks—before pooling, and orthogonal to the choice of algorithm or coding scheme. Many common threads from the c... |

9 |
Large-Scale Object Recognition with CUDA-Accelerated Hierarchical Neural Networks
- Uetz, S
- 2009
(Show Context)
Citation Context ...Algorithm Accuracy (error) Conv. Neural Network [16] 93.4% (6.6%) Deep Boltzmann Machine [26] 92.8% (7.2%) Deep Belief Network [20] 95.0% (5.0%) (Best result of [11]) 94.4% (5.6%) Deep neural network =-=[27]-=- 97.13% (2.87%) Sparse auto-encoder 96.9% (3.1%) Sparse RBM 96.2% (3.8%) K-means (Hard) 96.9% (3.1%) K-means (Triangle) 97.0% (3.0%) K-means (Triangle, 4000 features) 97.21% (2.79%) or more features w... |

6 |
Factored 3-way restricted Boltzmann machines for modeling natural images
- Ranzato, Krizhevsky, et al.
(Show Context)
Citation Context ...er structures. Among the algorithms considered in the literature are sparse-coding [20, 15, 30], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 24], denoising auto-encoders [28], “factored” =-=[23]-=- and mean-covariance [22] RBMs, as well as many others [21, 17, 32]. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavil... |

5 |
Yihong Gong. Nonlinear learning using local coordinate coding
- Yu, Zhang
- 2009
(Show Context)
Citation Context ...ature are sparse-coding [20, 15, 31], RBMs [8, 11], sparse RBMs [16], sparse auto-encoders [6, 23], denoising auto-encoders [28], “factored” [22] and mean-covariance [21] RBMs, as well as many others =-=[24, 17, 33]-=-. Thus, amongst the many components of feature learning architectures, the unsupervised learning module appears to be the most heavily scrutinized. Some work, however, has considered the impact of oth... |

3 |
Jie Huang, and Léon Bottou. Learning methods for generic object recognition with invariance to pose and lighting
- LeCun, Fu
- 2004
(Show Context)
Citation Context ...chosen through cross-validation, thus increasing running times dramatically. Though it is true that recently introduced algorithms have consistently shown improvements on benchmark datasets like NORB =-=[14]-=- and CIFAR [11], there are several other factors that affect the final performance of a feature learning system. Specifically, 1there are many “meta-parameters” defining the network architecture, suc... |

1 |
Visual categorization with bags of key222 Coates, Honglak Lee, Andrew Y. Ng points
- Csurka, Dance, et al.
- 2004
(Show Context)
Citation Context ...thm as an alternative unsupervised learning module. K-means has been used less widely in “deep learning” work but has enjoyed wide adoption in computer vision for building codebooks of “visual words” =-=[5, 6, 15, 31]-=-, which are used to define higherlevel image features. This method has also been applied recursively to build multiple layers of features [1]. The effects of pooling and choice of activation function ... |

1 |
Visual categorization with bags of key- Coates, Honglak Lee, Andrew Y. Ng points
- Csurka, Dance, et al.
- 2004
(Show Context)
Citation Context ...thm as an alternative unsupervised learning module. K-means has been used less widely in “deep learning” work but has enjoyed wide adoption in computer vision for building codebooks of “visual words” =-=[5, 6, 15, 31]-=-, which are used to define higherlevel image features. This method has also been applied recursively to build multiple layers of features [1]. The effects of pooling and choice of activation function ... |