## Some Notes on Applied Mathematics for Machine Learning (2004)

### Cached

### Download Links

- [research.microsoft.com]
- [research.microsoft.com]
- [csl.anu.edu.au]
- [users.cecs.anu.edu.au]
- DBLP

### Other Repositories/Bibliography

Venue: | Advanced Lectures on Machine Learning |

Citations: | 1 - 1 self |

### BibTeX

@INPROCEEDINGS{Burges04somenotes,

author = {Christopher J. C. Burges},

title = {Some Notes on Applied Mathematics for Machine Learning},

booktitle = {Advanced Lectures on Machine Learning},

year = {2004},

pages = {21--40},

publisher = {Springer}

}

### OpenURL

### Abstract

This chapter 1 describes Lagrange multipliers and some selected subtopics from matrix analysis from a machine learning perspective. The goal is to give a detailed description of a number of mathematical constructions that are widely used in applied machine learning.

### Citations

4660 |
Matrix analysis
- Horn, Johnson
- 1986
(Show Context)
Citation Context ...h in 1813, in Paris. 3 Some Notes on Matrices This section touches on some useful results in the theory of matrices that are rarely emphasized in coursework. For a complete treatment, see for example =-=[12]-=- and [11]. Following [12], the set of p by q matrices is denoted Mpq, thesetof (square) p by p matrices by Mp, and the set of symmetric p by p matrices by Sp. We work only with real matrices - the gen... |

3661 |
Convex optimization
- Boyd, Vandenberghe
(Show Context)
Citation Context ...lution by requiring that the gradients of the Lagrangian vanish, and we also have λc(x∗) = 0. This latter condition is one of the important Karush-Kuhn-Tucker conditions of convex optimization theory =-=[15, 4]-=-, and can facilitate the search for the solution, as the next exercise shows. For multiple inequality constraints, again at the solution ∇f must lie in the space spanned by the ∇ci, and again if the L... |

2711 | Indexing by Latent Semantic Analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...D. SVD is perhaps less familiar, but it plays important roles in everything from theorem proving to algorithm design (for example, for a classic result on applying SVD to document categorization, see =-=[10]-=-). The key observation is that, given A ∈ Mmn, although we cannot perform an eigendecomposition of A, we can do so for the two matrices AAT ∈ Sm and AT A ∈ Sn. Sincebothof these are positive semidefin... |

2277 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ... machines, the Lagrange multipliers have a physical force interpretation, and can be used to find the exact solution to the problem of separating points in a symmetric simplex in arbitrary dimensions =-=[6]-=-. For the remaining algorithms mentioned here, see [7] for details on the underlying mathematics. In showing that the principal PCA directions give minimal reconstruction error, one requires that the ... |

1610 | Nonlinear dimensionality reduction by locally linear embedding
- Roweis, Saul
- 2000
(Show Context)
Citation Context ...inimal reconstruction error, one requires that the projection directions being sought after are orthogonal, and this can be imposed by introducing a matrix of multipliers. In locally linear embedding =-=[17]-=-, the translation invariance constraint is imposed for each local patch by a multiplier, and the constraint that a solution matrix in the reconstruction algorithm be orthogonal is again imposed by a m... |

1107 |
Statistics for Spatial Data
- Cressie
- 1993
(Show Context)
Citation Context ... classification, where in Gaussian process regression, c is needed to compute the variance in the estimate of the function value f(x) atthetestpointx, andband c are needed to compute the mean of f(x) =-=[9, 20]-=-. Lemma 2. : Given K ∈ Mn−1 and K+ ∈ Mn as defined above, then the elements of K+ are given by: and furthermore, c = 1 u − v ′ K−1v 1 b = − u − v ′ K−1v v′ K −1 Aij = K −1 ij + 1 u − v ′ K−1v (v′ K −1... |

729 | Laplacian eigenmaps for dimensionality reduction and data representation
- Belkin, Niyogi
- 2003
(Show Context)
Citation Context ...x, subject to the constraint that the length, ρ = � 1 � 1+y ′2dx, is fixed (here, prime denotes differentiation with 0 respect to x). The Lagrangian is therefore � 1 �� 1 � � L = ydx + λ 1+y ′2dx − ρ =-=(2)-=- 0 0 Two new properties of the problem appear here: first, integrals appear in the Lagrangian, and second, we are looking for a solution which is a function, not a point. To solve this we will use the... |

476 | Probabilistic principal component analysis
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ... ′ =(1 + WW ′ ) −1WW ′ = WW ′ (1 + WW ′ ) −1 . (This is used, for example, in the derivation of the conditional distribution of the latent variables given the observed variables, in probabilistic PCA =-=[19]-=-.) The Minkowski p norm has the important property that �Ax�p ≤�A�p�x�p . Let’s use this, and the L1 and L∞ matrix norms, to prove a basic fact about stochastic matrices. A matrix P is stochastic if i... |

404 |
Multidimensional Scaling
- Cox, Cox
- 2000
(Show Context)
Citation Context ...matrix, find the underlying vectors xi ∈Rd ,wheredis chosen to be as small as possible, given the constraint that the distance matrix reconstructed from the xi approximates D with acceptable accuracy =-=[8]-=-. d is chosen to be small essentially to remove unimportant variance from the problem (or, if sufficiently small, for data visualization). Now let e be the column vector of n ones, and introduce the ’... |

262 | Using maximum entropy for text classification
- Nigam, Lafferty, et al.
- 1999
(Show Context)
Citation Context ...ge multipliers λi, which are themselves constrained by Eq. (8). For an example from the document classification task of how imposing linear constraints on the probabilities can arise in practice, see =-=[16]-=-. 2.8 Some Algorithm Examples Lagrange multipliers are ubiquitous for imposing constraints in algorithms. Here we list their use in a few modern machine learning algorithms; in all of these applicatio... |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ... classification, where in Gaussian process regression, c is needed to compute the variance in the estimate of the function value f(x) atthetestpointx, andband c are needed to compute the mean of f(x) =-=[9, 20]-=-. Lemma 2. : Given K ∈ Mn−1 and K+ ∈ Mn as defined above, then the elements of K+ are given by: and furthermore, c = 1 u − v ′ K−1v 1 b = − u − v ′ K−1v v′ K −1 Aij = K −1 ij + 1 u − v ′ K−1v (v′ K −1... |

193 |
Nonlinear programming
- Mangasarian
- 1994
(Show Context)
Citation Context ...lution by requiring that the gradients of the Lagrangian vanish, and we also have λc(x∗) = 0. This latter condition is one of the important Karush-Kuhn-Tucker conditions of convex optimization theory =-=[15, 4]-=-, and can facilitate the search for the solution, as the next exercise shows. For multiple inequality constraints, again at the solution ∇f must lie in the space spanned by the ∇ci, and again if the L... |

127 |
Mathematical Thoughtfrom Ancient to Modern Times
- Kline
- 1972
(Show Context)
Citation Context ...while a perimeter is held fixed - were considered in ancient times, but serious work on them began only towards the end of the seventeenth century, with a minor battle between the Bernouilli brothers =-=[14]-=-. It is a fitting example for us, since the general isoperimetric problem had been discussed for fifty years before Lagrange solved it in his first venture into mathematics [1], and it provides an int... |

58 | Remarks to Maurice Frechet’s article: Sur la definition axiomatique d’une classe d’espaces vectoriels distancies applicables vectoriellement sur l’espace de
- SCHOENBERG
- 1935
(Show Context)
Citation Context ...r any dot product matrix Aij ∈ Sm ≡ xi · xj, i,j =1,...,m, xi ∈Rn ,then(PeAP e )ij = (xi − µ) · (xj − µ), whereµ is the mean of the xi. The earliest form of the following theorem is due to Schoenberg =-=[18]-=-. For a proof of this version, see [7]. Theorem 2. Consider the class of symmetric matrices A ∈ Sn such that Aij ≥ 0 and Aii =0 ∀i, j. Then Ā ≡−P e AP e is positive semidefinite if and only if A is a ... |

36 | Bayesian methods: General background
- Jaynes
- 1986
(Show Context)
Citation Context ...the constrained probability distribution be as large as possible. Maximum entropy provides a principled way to encode our uncertainty in a model, and it is the precursor to modern Bayesian techniques =-=[13]-=-. Since the mean number of bits is just the entropy of the distribution, we wish to find that distribution that maximizes4 − � Pi log Pi + � λj(Cj − � αjiPi)+µ( � Pi − 1) − � (9) i j 4 The factor log2... |

31 | Geometric Methods for Feature Extraction and Dimensionality Reduction,” Data Mining and Knowledge Discovery
- Burges
- 2005
(Show Context)
Citation Context ...rce interpretation, and can be used to find the exact solution to the problem of separating points in a symmetric simplex in arbitrary dimensions [6]. For the remaining algorithms mentioned here, see =-=[7]-=- for details on the underlying mathematics. In showing that the principal PCA directions give minimal reconstruction error, one requires that the projection directions being sought after are orthogona... |

14 |
Short Account of the History of Mathematics
- Ball
- 1960
(Show Context)
Citation Context ...he Bernouilli brothers [14]. It is a fitting example for us, since the general isoperimetric problem had been discussed for fifty years before Lagrange solved it in his first venture into mathematics =-=[1]-=-, and it provides an introduction to functional derivatives, which we’ll need. Let’s consider a classic isoperimetric problem: to find the plane figure with maximum area, given fixed perimeter. Consid... |