## Regularized Principal Manifolds (2001)

### Cached

### Download Links

- [www.ai.mit.edu]
- [mlg.anu.edu.au]
- [pca.narod.ru]
- [users.cecs.anu.edu.au]
- [mlg.anu.edu.au]
- [www.ai.mit.edu]
- [jmlr.csail.mit.edu]
- [www.kernel-machines.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Computational Learning Theory: 4th European Conference |

Citations: | 32 - 4 self |

### BibTeX

@INPROCEEDINGS{Smola01regularizedprincipal,

author = {Alexander J. Smola and Robert C. Williamson},

title = {Regularized Principal Manifolds},

booktitle = {In Computational Learning Theory: 4th European Conference},

year = {2001},

pages = {214--229},

publisher = {Springer}

}

### Years of Citing Articles

### OpenURL

### Abstract

Many settings of unsupervised learning can be viewed as quantization problems - the minimization

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... a few terms. For more details on regularization operators see e.g., (Girosi et al., 1995, Smola et al., 1998, Girosi, 1998). Essentially one may use any kernel introduced in support vector machines (=-=Vapnik, 1998-=-), Gaussian processes (Williams, 1998), or reproducing kernel Hilbert spaces (Wahba, 1990) in the expansions described above. The appealing property of this formulation is that it is completely indepe... |

8090 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... + Q( 1 ; : : : ; M ) # : (29) The minimization here is achieved in an iterative fashion by coordinate descent over and . It operates analogously to the EM (expectation maximization) algorithm (Demps=-=ter et al., -=-1977): there the aim is tosnd (the parameters of) a distribution p (x; l) where x are observations and l are latent variables. Keeping sxed one proceeds by maximizing p (x; l) with respect to l. Th... |

2030 | Learning with Kernels - Scholkopf, Smola - 2002 |

1827 |
Robust statistics
- Huber
- 1981
(Show Context)
Citation Context ...;:::;kg kx f z k 1 d(x): (7) This setting is robust against outliers, since the maximum in uence of each pattern is bounded. An intermediate setting can be derived from Huber's robust cost function (H=-=uber-=-, 1981). Here we have c(x; f(z)) = 1 2 kx f(z)k 2 for kx f(z)k kx f(z)k 2 otherwise, (8) for suitably chosen . Eq. (8) behaves like a k-means vector quantizer for small x i , however with the buil... |

1652 | Atomic decomposition by basis pursuit - Chen, Donoho, et al. - 2001 |

1650 | Vector Quantization and Signal Compression - Gersho, Gray - 1991 |

1291 | A training algorithm for optimal margin classifiers
- Boser, Guyon, et al.
(Show Context)
Citation Context ...n as (x) = p 1 1 (x); p 2 2 (x); : : : : (67) This property has been used to develop and understand learning algorithms for RBF networks (Aizerman et al., 1964), support vector machines (Boser e=-=t al., 19-=-92), and kernel PCA (Scholkopf et al., 1998). In the current context we will use the geometrical viewpoint to provide bounds on the entropy numbers of the classes of functions F generated by kernels. ... |

1273 | Spline models for observational data - Wahba - 1990 |

1048 | Nonlinear component analysis as a kernel eigenvalue problem - Schölkopf, Smola, et al. - 1998 |

842 | Least squares quantization in pcm
- Lloyd
- 1982
(Show Context)
Citation Context ...and again c(x; f(z)) = kx f(z)k 2 . Then R[f ] := Z X min z2f1;:::;kg kx f z k 2 d(x) (6) denotes the canonical distortion error of a vector quantizer. In practice one can use the k-means algorithm (L=-=loyd, 1982-=-) tosnd a set of vectors ff 1 ; : : : ; f k g minimizing R emp [f ]. Also here (Bartlett et al., 1998), one can prove convergence properties of (the minimizer) of R emp [f ] to (one of) the minimizer(... |

824 |
Solution of Ill-posed Problems
- Tikhonov, Arsenin
(Show Context)
Citation Context ...(1) analyzes the empirical quantization error R m emp [f ] := R emp [f ] := Z X min z2Z c(x; f(z))d m (x) = 1 m m X i=1 min z2Z c(x i ; f(z)): (3) The general problem of minimizing (3) is ill-posed (T=-=ikhonov and Arsenin, 1977-=-, Morozov, 1984). Even worse - with no further restrictions on F, small values of R emp [f ] do not guarantee small values of R[f ] either. Many problems of unsupervised learning can be cast in the fo... |

803 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1979
(Show Context)
Citation Context ...ppendix. Section 7.4 gives overall sample complexity rates. In order to avoid technical complications arising from unbounded cost functions (like boundedness of some moments of the distribution (x) (V=-=apnik, 19-=-82)) we will assume that there exists some r > 0 such that the probability measure of a ball of radius r is 1, i.e., (U r ) = 1. Kegl et al. (2000) showed that under these assumptions also the prinici... |

567 | Convergence of Stochastic Processes - Pollard - 1984 |

311 | Regularization theory and neural-network architectures, Neural Computation - Girosi, Jones, et al. - 1995 |

295 |
Principal curves
- Hastie, Stuetzle
- 1989
(Show Context)
Citation Context ...est?" This means that one is looking for a descriptive model of the data, thus also a (possibly quite crude) model of the underlying probability distribution. Generative models like principal cur=-=ves (Hastie and Stuetzle, 1989-=-), the generative topographic map (Bishop et al., 1998), several linear Gaussian models (Roweis and Ghahramani, 1999), or also simple vector quantizers (Bartlett et al., 1998) are examples thereof. We... |

282 | Theoretical foundations of the potential function method in pattern recognition learning,” Automation and Remote Control - Aizerman, Braverman, et al. - 1964 |

280 | GTM: The generative topographic mapping
- Bishop, Svensén, et al.
- 1998
(Show Context)
Citation Context ... to be a grid, identical with the points zi in our setting, and the different regularizer (called Gaussian prior in that case) which is of ℓ2 type. In other words instead of using �Pf�2 Bishop e=-=t al. [2] choose � -=-i �αi�2 as a regularizer. Finally in the GTM several ζi may take on “responsibility” for having generated a data-point xi (this follows naturally from the generative model setting in the lat... |

263 | A unifying review of linear Gaussian models, Neural Computation 11
- Roweis, Ghahramani
- 1999
(Show Context)
Citation Context ... of the underlying probability distribution. Generative models like principal curves (Hastie and Stuetzle, 1989), the generative topographic map (Bishop et al., 1998), several linear Gaussian models (=-=Roweis and Ghahramani, 1999-=-), or also simple vector quantizers (Bartlett et al., 1998) are examples thereof. We will study the second type of model in the present paper. Since many problems of unsupervised learning can be forma... |

255 |
Functions of positive and negative type and their connection with the theory of integral equations
- Mercer
(Show Context)
Citation Context ...under consideration. It turns out that a feature space representation of kernels k is useful in this regard. In particular, like (15) we can write any kernel k(x; x 0 ) satisfying Mercer's condition (=-=Mercer,-=- 1909) as a dot product in some feature space (see Appendix A.2 for details) by k(x; x 0 ) = X i i i (x) i (x 0 ): (51) Here ( i ; i ) is the eigensystem of the operator T k f := R Z f(x 0 )k(x 0 ;... |

246 | Group Theory and its Applications to Physical Problems (Addison-Wesley - Hamermesh - 1962 |

203 | An equivalence between sparse approximation and support vector machines - Girosi - 1998 |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...egularization operators see e.g., (Girosi et al., 1995, Smola et al., 1998, Girosi, 1998). Essentially one may use any kernel introduced in support vector machines (Vapnik, 1998), Gaussian processes (=-=Williams, 1998-=-), or reproducing kernel Hilbert spaces (Wahba, 1990) in the expansions described above. The appealing property of this formulation is that it is completely independent of the dimensionality and parti... |

193 | Nonlinear programming - Mangasarian - 1994 |

148 |
Spline models for observational data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ... Smola et al., 1998, Girosi, 1998). Essentially one may use any kernel introduced in support vector machines (Vapnik, 1998), Gaussian processes (Williams, 1998), or reproducing kernel Hilbert spaces (=-=Wahba, 1990-=-) in the expansions described above. The appealing property of this formulation is that it is completely independent of the dimensionality and particular structure of Z. 3.3 Linear Programming Regular... |

146 | The connection between regularization operators and support vector kernels. Neural Networks - Smola, Scholkopf, et al. - 1998 |

122 |
Methods for solving incorrectly posed problems
- Morozov
- 1984
(Show Context)
Citation Context ...uantization error R m emp [f ] := R emp [f ] := Z X min z2Z c(x; f(z))d m (x) = 1 m m X i=1 min z2Z c(x i ; f(z)): (3) The general problem of minimizing (3) is ill-posed (Tikhonov and Arsenin, 1977, M=-=orozov, 1984-=-). Even worse - with no further restrictions on F, small values of R emp [f ] do not guarantee small values of R[f ] either. Many problems of unsupervised learning can be cast in the form ofsnding a m... |

113 | Training with noise is equivalent to Tikhonov regularization - Bishop - 1995 |

74 | Learning and Design of Principal Curves - Kégl, Krzyzak, et al. - 2000 |

73 | Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators
- Williamson, Smola, et al.
- 2001
(Show Context)
Citation Context ...in X . The distance measure (no metric!) for N (ε) is defined as supx∈Ur |∆(x, f) − ∆(x, f ′ )|≤ε. Here∆(x, f) is the minimum distance between a curve f(·) andx∈Ur. By using functio=-=nal analytic tools [13]-=- one can obtain more general results, which then, in turn, can replace (24) to obtain better bounds on the expected quantization error by using the properties of the regularization operator. Denote by... |

62 | MINOS 5.1 User’s Guide - Murtagh, Saunders - 1987 |

61 |
Entropy, compactness and the approximation of operators
- Carl, Stephani
- 1990
(Show Context)
Citation Context ...dix A for details. The rates obtained in proposition 5 are quite strong. In particular recall that for compact sets insnite dimensional spaces of dimension d the covering number is N("; F) = O(&q=-=uot; d ) (Carl and Stephani, 1990-=-). In view of (52) this means that even though we are dealing with a nonparametric estimator, it behaves almost as if it was asnite dimensional one. All that is left is to substitute (52) and (53) int... |

59 | GTM: The generative topographic mapping. Neural computation 10(1):215–234 - CM, Svensén, et al. - 1998 |

55 | Combining support vector and mathematical programming methods for classification - Bennett - 1999 |

49 | Massive data discrimination via linear support vector machines. Optimization Methods and Software - Bradley, Mangasarian |

49 | Clustering via Concave Minimization - Bradley, Mangasarian, et al. - 1997 |

30 | Support vector density estimation - Weston, Gammerman, et al. - 1999 |

27 | The minimax distortion redundancy in empirical quantizer design
- Bartlett, Linder, et al.
- 1998
(Show Context)
Citation Context ...ike principal curves (Hastie and Stuetzle, 1989), the generative topographic map (Bishop et al., 1998), several linear Gaussian models (Roweis and Ghahramani, 1999), or also simple vector quantizers (=-=Bartlett et al., 1998-=-) are examples thereof. We will study the second type of model in the present paper. Since many problems of unsupervised learning can be formalized in a quantization-functional setting (see section 2)... |

27 | LOQO users manual - version 3.10 - Vanderbei - 1997 |

20 |
The motion coherence theory
- Yuille, Grzywacz
- 1988
(Show Context)
Citation Context ...ng Rreg[f] Minimizing the regularized quantization functional for a given kernel expansion is equivalent to solving ⎡ � m� � ⎢ � ⎣ � � � xi � M� �2 � − αjk(ζi,zj) � �=-=�� + � λ ⎤ M� ⎥ 〈αi,αj〉k(zi,zj) ⎦ . (14) 2 min {�-=-� 1 ,...,α M }⊂X {ζ 1 ,...,ζm}⊂Z i=1 j=1 i,j=1 This is achieved in an iterative fashion analogously to how the EM algorithm operates. One iterates over minimizing (14) with respect to {ζ1,...,... |

11 | A correspondence between Bayesan estimation on stochastic processes and smoothing by splines - Kimeldorf, Wahba - 1971 |

10 | IBM optimization subroutine library guide and reference - Corporation - 1992 |

8 | Linear programming support vectors machines for pattern classification and regression estimation and the set reduction algorithm - Friess, Harrison - 1998 |

6 | Entropy numbers for convex combinations and MLPs - Smola, Elisseeff, et al. - 2000 |

6 | A Theory of Learning in Artificial Neural Networks - Anthony, Bartlett - 1999 |

4 |
Smoothing and ill-posed problems
- Wahba
- 1978
(Show Context)
Citation Context ... ] = 1 2 kPfk 2 : (13) Here P is a regularization operator penalizing unsmooth functions f via a mapping into a dot product space (e.g., a reproducing kernel Hilbert space (Kimeldorf and Wahba, 1971, =-=Wahba, 19-=-79, 1990)). In this case one obtains R reg [f ] := R emp [f ] + 2 kPfk 2 = 1 m m X i=1 min z2Z kx i f(z)k 2 + 2 kPfk 2 : (14) As we will show in section 4, if one requires certain invariances regard... |

2 | Nonlinear principal component analysis
- Der, Steinmetz, et al.
- 1998
(Show Context)
Citation Context ...n theory. In particular this leads to a natural generalization (to higher dimensionality and different criteria of regularity) of the principal curves algorithm with a length constraint [7]. See also =-=[4]-=- for an overview and background on principal curves. Experimental results demonstrate the feasibility of this approach. In the second part we use the quantization functional approach to give uniform c... |

2 | k--plane clustering. Mathematical Programming - Bradley, Mangasarian - 1998 |

2 | Generalization bounds for convex combinations of kernel functions - Smola, Williamson, et al. - 1998 |

1 |
A Theory of Learning in Arti Neural Networks
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...2+s) for exponential rates of decay (sis an arbitrary positive constant). It would be surprising if we could do any better given that supervised learning rates are typically no better than O(m 1=2 ) (=-=Anthony and Bartlett,-=- 1999, Chapter 19). In the following we assume that F is compact; this is true of all the specic F considered above. Proposition 6 (Learning Rates for Principal Manifolds) Suppose F dened by (41) is c... |

1 | Regularization theory and neural networks architectures - Computation - 1998 |