Results 1  10
of
1,620
Generalized Additive Models
, 1990
"... Liklihood based regression models, such as the normal linear regression model and the linear logistic model, assume a linear (or some other parametric) form for the covariate effects. We introduce the Local Scotinq procedure which replaces the liner form C Xjpj by a sum of smooth functions C Sj(Xj)a ..."
Abstract

Cited by 1368 (34 self)
 Add to MetaCart
Liklihood based regression models, such as the normal linear regression model and the linear logistic model, assume a linear (or some other parametric) form for the covariate effects. We introduce the Local Scotinq procedure which replaces the liner form C Xjpj by a sum of smooth functions C Sj(Xj)a The Sj(.) ‘s are unspecified functions that are estimated using scatterplot smoothers. The technique is applicable to any likelihoodbased regression model: the class of Generalized Linear Models contains many of these. In this class, the Locul Scoring procedure replaces the linear predictor VI = C Xj@j by the additive predictor C ai ( hence, the name Generalized Additive Modeb. Local Scoring can also be applied to nonstandard models like Cox’s proportional hazards model for survival data. In a number of real data examples, the Local Scoring procedure proves to be useful in uncovering nonlinear covariate effects. It has the advantage of being completely automatic, i.e. no “detective work ” is needed on the part of the statistician. In a further generalization, the technique is modified to estimate the form of the link function for generalized linear models. The Local Scoring procedure is shown to be asymptotically equivalent to Local Likelihood estimation, another technique for estimating smooth covariate functions. They are seen to produce very similar results with real data, with Local Scoring being considerably faster. As a theoretical underpinning, we view Local Scoring and Local Likelihood as empirical maximizers of the ezpected loglikelihood, and this makes clear their connection to standard maximum likelihood estimation. A method for estimating the “degrees of freedom ” of the procedures is also given.
Additive Logistic Regression: a Statistical View of Boosting
 Annals of Statistics
, 1998
"... Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input dat ..."
Abstract

Cited by 1250 (21 self)
 Add to MetaCart
Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and taking a weighted majority vote of the sequence of classifiers thereby produced. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the twoclass problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most...
Hierarchical mixtures of experts and the EM algorithm
 Neural Computation
, 1994
"... We present a treestructured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM’s). Learning is treated as a maximum likelihood ..."
Abstract

Cited by 737 (19 self)
 Add to MetaCart
We present a treestructured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM’s). Learning is treated as a maximum likelihood problem; in particular, we present an ExpectationMaximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an online learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain. 1
Dynamic Bayesian Networks: Representation, Inference and Learning
, 2002
"... Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have bee ..."
Abstract

Cited by 579 (3 self)
 Add to MetaCart
Modelling sequential data is important in many areas of science and engineering. Hidden Markov models (HMMs) and Kalman filter models (KFMs) are popular for this because they are simple and flexible. For example, HMMs have been used for speech recognition and biosequence analysis, and KFMs have been used for problems ranging from tracking planes and missiles to predicting the economy. However, HMMs
and KFMs are limited in their “expressive power”. Dynamic Bayesian Networks (DBNs) generalize HMMs by allowing the state space to be represented in factored form, instead of as a single discrete random variable. DBNs generalize KFMs by allowing arbitrary probability distributions, not just (unimodal) linearGaussian. In this thesis, I will discuss how to represent many different kinds of models as DBNs, how to perform exact and approximate inference in DBNs, and how to learn DBN models from sequential data.
In particular, the main novel technical contributions of this thesis are as follows: a way of representing
Hierarchical HMMs as DBNs, which enables inference to be done in O(T) time instead of O(T 3), where T is the length of the sequence; an exact smoothing algorithm that takes O(log T) space instead of O(T); a simple way of using the junction tree algorithm for online inference in DBNs; new complexity bounds on exact online inference in DBNs; a new deterministic approximate inference algorithm called factored frontier; an analysis of the relationship between the BK algorithm and loopy belief propagation; a way of
applying RaoBlackwellised particle filtering to DBNs in general, and the SLAM (simultaneous localization
and mapping) problem in particular; a way of extending the structural EM algorithm to DBNs; and a variety of different applications of DBNs. However, perhaps the main value of the thesis is its catholic presentation of the field of sequential data modelling.
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 498 (32 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties
, 2001
"... Variable selection is fundamental to highdimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized ..."
Abstract

Cited by 354 (25 self)
 Add to MetaCart
Variable selection is fundamental to highdimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.
Predicting the Semantic Orientation of Adjectives
, 1997
"... We identify and validate from a large corpus constraints from conjunctions on the positive or negative semantic orientation of the conjoined adjectives. A loglinear regression model uses these constraints to predict whether conjoined adjectives are of same or different orientations, achiev ..."
Abstract

Cited by 317 (5 self)
 Add to MetaCart
We identify and validate from a large corpus constraints from conjunctions on the positive or negative semantic orientation of the conjoined adjectives. A loglinear regression model uses these constraints to predict whether conjoined adjectives are of same or different orientations, achiev ing 82% accuracy in this task when each conjunction is considered independently.
Operations for Learning with Graphical Models
 Journal of Artificial Intelligence Research
, 1994
"... This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Wellknown examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models ..."
Abstract

Cited by 252 (12 self)
 Add to MetaCart
This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Wellknown examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models are extended to model data analysis and empirical learning using the notation of plates. Graphical operations for simplifying and manipulating a problem are provided including decomposition, differentiation, and the manipulation of probability models from the exponential family. Two standard algorithm schemas for learning are reviewed in a graphical framework: Gibbs sampling and the expectation maximization algorithm. Using these operations and schemas, some popular algorithms can be synthesized from their graphical specification. This includes versions of linear regression, techniques for feedforward networks, and learning Gaussian and discrete Bayesian networks from data. The paper conclu...
Heterogeneous uncertainty sampling for supervised learning
 In Proceedings of the 11th International Conference on Machine Learning (ICML
, 1994
"... Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of instances that an expert need label. One problem with this approach is that the classifier best suit ..."
Abstract

Cited by 240 (3 self)
 Add to MetaCart
Uncertainty sampling methods iteratively request class labels for training instances whose classes are uncertain despite the previous labeled instances. These methods can greatly reduce the number of instances that an expert need label. One problem with this approach is that the classifier best suited for an application may be too expensive to train or use during the selection of instances. We test the use of one classifier (a highly efficient probabilistic one) to select examples for training another (the C4.5 rule induction program). Despite being chosen by this heterogeneous approach, the uncertainty samples yielded classifiers with lower error rates than random samples ten times larger. 1
Bayesian reconstructions from emission tomography data using a modified EM algorithm
 IEEE Trans. Med. Imag
, 1990
"... AbstractA new method of reconstruction from SPECT data is proposed, which builds on the EM approach to maximum likelihood reconstruction from emission tomography data, but aims instead at maximum posterior probability estimation, that takes account of prior belief about “smoothness ” in the isotope ..."
Abstract

Cited by 194 (3 self)
 Add to MetaCart
AbstractA new method of reconstruction from SPECT data is proposed, which builds on the EM approach to maximum likelihood reconstruction from emission tomography data, but aims instead at maximum posterior probability estimation, that takes account of prior belief about “smoothness ” in the isotope concentration. A novel modification to the EM algorithm yields a practical method. The method is illustrated by an application to data from brain scans. I.