## Recognizing handwritten digits using mixtures of linear models (1995)

### Cached

### Download Links

Venue: | Advances in Neural Information Processing Systems 7 |

Citations: | 56 - 6 self |

### BibTeX

@INPROCEEDINGS{Hinton95recognizinghandwritten,

author = {Geoffrey E Hinton and Michael Revow and Peter Dayan},

title = {Recognizing handwritten digits using mixtures of linear models},

booktitle = {Advances in Neural Information Processing Systems 7},

year = {1995},

pages = {1015--1022},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We construct a mixture of locally linear generative models of a collection of pixel-based images of digits, and use them for recognition. Different models of a given digit are used to capture different styles of writing, and new images are classified by evaluating their log-likelihoods under each model. We use an EM-based algorithm in which the M-step is computationally straightforward principal components analysis (PCA). Incorporating tangent-plane information [12] about expected local deformations only requires adding tangent vectors into the sample covariance matrices for the PCA, and it demonstrably improves performance. 1

### Citations

8134 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...] where vector quantization was used to define sub-classes and PCA was performed within each sub-class (see also [2]). We used an iterative method based on the Expectation Maximisation (EM) algorithm =-=[4]-=- to fit mixtures of linear models. The reductio of the local linear approach would have just one training pattern in each model. This approach would amount to a nearest neighbour method for recognitio... |

238 |
Efficient pattern recognition using a new transformation distance
- Simard, LeCun, et al.
- 1993
(Show Context)
Citation Context ...mbiguous cases. For a given linear model, not counting the code cost implies that deformations of the images along the principal components for the sub-class are free. This is like the metric used by =-=[11]-=- except that they explicitly specified the directions in which deformations should be free, rather than learning them from the data. We wished to incorporate information about the preferred directions... |

224 | The wake– sleep algorithm for unsupervised neural networks
- Hinton, Dayan, et al.
- 1995
(Show Context)
Citation Context ... maximum likelihood factor analysis (which is closely related to PCA) and the resulting architecture and hierarchical variants of it can be formulated as real valued versions of the Helmholtz machine =-=[3, 6]-=-. The cost of coding the factors relative to their prior is implicitly included in this formulation, as is the possibility that different input pixels are subject to different amounts of noise. Unlike... |

170 |
A database for handwritten text recognition research
- Hull
- 1994
(Show Context)
Citation Context ...es. 3 Results We have evaluated the performance of the system on data from the CEDAR CDROM 1 database containing handwritten digits lifted from mail pieces passing through a United States Post Office =-=[8]-=-. We divided the br training set of binarysClustering Recognition Raw Errors None None 62 (3.10%) Heavy Light 29 (1.45%) Heavy None 45 (2.25%) Heavy Heavy 90 (4.50%) Table 1: Classification errors on ... |

111 | Autoencoders, minimum description length and Helmholtz free energy
- Hinton, Zemel
- 1993
(Show Context)
Citation Context ...ich the input-hidden weights produce a code for a particular case andsthe hidden-output weights embody a generative model which turns this code back into a close approximation of the original example =-=[14, 7]-=-. Code costs (under some prior) and reconstruction error (squared error assuming an isotropic Gaussian misfit model) sum to give the overall code length which can be viewed as a lower bound on the log... |

78 |
Auto-association by multilayer perceptrons and singular value decomposition
- Bourlard, Kamp
- 1988
(Show Context)
Citation Context ...nising handwritten digits from grey-level pixel images using linear auto-encoders. Linear hidden units for autoencoders are barely worse than non-linear ones when squared reconstruction error is used =-=[1]-=-, but have the great computational advantage during training that input-hidden and hiddenoutput weights can be derived from principal components analysis (PCA) of the training data. In effect a PCA en... |

68 |
Tangent prop - A formalism for specifying selected invariances in an adaptive network
- Simard, Victorri, et al.
- 1992
(Show Context)
Citation Context ...heir log-likelihoods under each model. We use an EM-based algorithm in which the M-step is computationally straightforward principal components analysis (PCA). Incorporating tangent-plane information =-=[12]-=- about expected local deformations only requires adding tangent vectors into the sample covariance matrices for the PCA, and it demonstrably improves performance. 1 Introduction The usual way of using... |

52 | A minimum description length framework for unsupervised learning
- Zemel
- 1993
(Show Context)
Citation Context ...ich the input-hidden weights produce a code for a particular case andsthe hidden-output weights embody a generative model which turns this code back into a close approximation of the original example =-=[14, 7]-=-. Code costs (under some prior) and reconstruction error (squared error assuming an isotropic Gaussian misfit model) sum to give the overall code length which can be viewed as a lower bound on the log... |

36 | Transformation invariant autoassociation with application to handwritten character recognition
- Schwenk, Milgram
- 1995
(Show Context)
Citation Context ...f these segments. Care should be taken in generalising this picture to high dimensional spaces. The next section develops the theory behind variants of these systems (which is very similar to that in =-=[5, 10]-=-), and section 3 discusses how they perform. 2 Theory Linear auto-encoders embody a model in which variations from the mean of a population along certain directions are cheaper than along others, as m... |

29 | E.: Learning prototype models for tangent distance
- Hastie, Simard, et al.
- 1994
(Show Context)
Citation Context ...f these segments. Care should be taken in generalising this picture to high dimensional spaces. The next section develops the theory behind variants of these systems (which is very similar to that in =-=[5, 10]-=-), and section 3 discusses how they perform. 2 Theory Linear auto-encoders embody a model in which variations from the mean of a population along certain directions are cheaper than along others, as m... |

20 |
The helmholtz machine. Neural computation
- Dayan, Hinton, et al.
- 1995
(Show Context)
Citation Context ... maximum likelihood factor analysis (which is closely related to PCA) and the resulting architecture and hierarchical variants of it can be formulated as real valued versions of the Helmholtz machine =-=[3, 6]-=-. The cost of coding the factors relative to their prior is implicitly included in this formulation, as is the possibility that different input pixels are subject to different amounts of noise. Unlike... |

7 |
The Component Object Model: A
- Williams, Kindel
- 1994
(Show Context)
Citation Context ...rived from principal components analysis (PCA) of the training data. In effect a PCA encoder approximates the entire N dimensional distribution of the data with a lower dimensional "Gaussian panc=-=ake" [13]-=-, choosing, for optimal data compression, to retain just a few of the PCs. One could build a single PCA model for each digit -- however the many different styles of writing suggest that more than one ... |

4 |
Nonlinear image interpolation using surface learning
- Bregler, Omohundro
- 1994
(Show Context)
Citation Context ...imating each class by its own model. A similar idea for data compression was used by [9] where vector quantization was used to define sub-classes and PCA was performed within each sub-class (see also =-=[2]-=-). We used an iterative method based on the Expectation Maximisation (EM) algorithm [4] to fit mixtures of linear models. The reductio of the local linear approach would have just one training pattern... |

2 |
Unsupervised learning of object models
- Williams, Zemel, et al.
- 1993
(Show Context)
Citation Context ...rived from principal components analysis (PCA) of the training data. In effect a PCA encoder approximates the entire N dimensional distribution of the data with a lower dimensional “Gaussian pancake” =-=[13]-=-, choosing, for optimal data compression, to retain just a few of the PCs. One could build a single PCA model for each digit – however the many different styles of writing suggest that more than one G... |