Results 1  10
of
111
An introduction to kernelbased learning algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2001
"... This paper provides an introduction to support vector machines (SVMs), kernel Fisher discriminant analysis, and ..."
Abstract

Cited by 377 (48 self)
 Add to MetaCart
This paper provides an introduction to support vector machines (SVMs), kernel Fisher discriminant analysis, and
Bayesian measures of model complexity and fit
 Journal of the Royal Statistical Society, Series B
, 2002
"... [Read before The Royal Statistical Society at a meeting organized by the Research ..."
Abstract

Cited by 138 (2 self)
 Add to MetaCart
[Read before The Royal Statistical Society at a meeting organized by the Research
An introduction to boosting and leveraging
 Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
On a Kernelbased Method for Pattern Recognition, Regression, Approximation, and Operator Inversion
, 1997
"... We present a Kernelbased framework for Pattern Recognition, Regression Estimation, Function Approximation and multiple Operator Inversion. Previous approaches such as ridgeregression, Support Vector methods and regression by Smoothing Kernels are included as special cases. We will show connection ..."
Abstract

Cited by 77 (25 self)
 Add to MetaCart
We present a Kernelbased framework for Pattern Recognition, Regression Estimation, Function Approximation and multiple Operator Inversion. Previous approaches such as ridgeregression, Support Vector methods and regression by Smoothing Kernels are included as special cases. We will show connections between the costfunction and some properties up to now believed to apply to Support Vector Machines only. The optimal solution of all the problems described above can be found by solving a simple quadratic programming problem. The paper closes with a proof of the equivalence between Support Vector kernels and Greene's functions of regularization operators.
A New Discriminative Kernel from Probabilistic Models
, 2002
"... Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called \Fisher kernel" has been combined with discriminative classi ers such as SVM and applied successfully in e.g. DNA and protein analysis. Whereas the Fisher kernel (FK) is ca ..."
Abstract

Cited by 62 (5 self)
 Add to MetaCart
Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called \Fisher kernel" has been combined with discriminative classi ers such as SVM and applied successfully in e.g. DNA and protein analysis. Whereas the Fisher kernel (FK) is calculated from the marginal loglikelihood, we propose the TOP kernel derived from Tangent vectors Of Posterior logodds. Furthermore, we develop a theoretical framework on feature extractors from probabilistic models and use it for analyzing the TOP kernel. In experiments our new discriminative TOP kernel compares favorably to the Fisher kernel.
Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions
, 1997
"... The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of lowtemperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known ..."
Abstract

Cited by 55 (3 self)
 Add to MetaCart
The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of lowtemperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam’s razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys ’ prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics.
Active Learning in Multilayer Perceptrons
, 1996
"... We propose an active learning method with hiddenunit reduction, which is devised specially for multilayer perceptrons (MLP). First, we review our active learning method, and point out that many Fisherinformationbased methods applied to MLP have a critical problem: the information matrix may be si ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
We propose an active learning method with hiddenunit reduction, which is devised specially for multilayer perceptrons (MLP). First, we review our active learning method, and point out that many Fisherinformationbased methods applied to MLP have a critical problem: the information matrix may be singular. To solve this problem, we derive the singularity condition of an information matrix, and propose an active learning technique that is applicable to MLP. Its effectiveness is verified through experiments. 1 INTRODUCTION When one trains a learning machine using a set of data given by the true system, its ability can be improved if one selects the training data actively. In this paper, we consider the problem of active learning in multilayer perceptrons (MLP). First, we review our method of active learning (Fukumizu el al., 1994), in which we prepare a probability distribution and obtain training data as samples from the distribution. This methodology leads us to an informationmatrix...
Latent Variable Models for Neural Data Analysis
, 1999
"... The brain is perhaps the most complex system to have ever been subjected to rigorous scientific investigation. The scale is staggering: over 1011 neurons, each making an average of 10 3 synapses, with computation occurring on scales ranging from a single dendritic spine, to an entire cortical area. ..."
Abstract

Cited by 46 (5 self)
 Add to MetaCart
The brain is perhaps the most complex system to have ever been subjected to rigorous scientific investigation. The scale is staggering: over 1011 neurons, each making an average of 10 3 synapses, with computation occurring on scales ranging from a single dendritic spine, to an entire cortical area. Slowly, we are beginning to acquire experimental tools that can gather the massive amounts of data needed to characterize this system. However, to understand and interpret these data will also require substantial strides in inferential and statistical techniques. This dissertation attempts to meet this need, extending and applying the modern tools of latent variable modeling to problems in neural data analysis. It is divided
Algebraic analysis for nonidentifiable learning machines
 Neural Computation
"... This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously pr ..."
Abstract

Cited by 45 (15 self)
 Add to MetaCart
This paper clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ1 log n − (m1 − 1) log log n+constant, where n is the number of training samples and λ1 and m1 are the rational number and the natural number which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ1 and m1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ1 is equal to the number of parameters and m1 = 1, whereas in nonregular models such as multilayer networks, 2λ1 is not larger than the number of parameters and m1 ≥ 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the nonidentifiable learning machines are the better models than the regular ones if the Bayesian ensemble learning is applied. 1 1