## Regularized Adaptation: Theory, Algorithms and Applications (2007)

Citations: | 10 - 0 self |

### BibTeX

@TECHREPORT{Li07regularizedadaptation:,

author = {Xiao Li},

title = {Regularized Adaptation: Theory, Algorithms and Applications},

institution = {},

year = {2007}

}

### OpenURL

### Abstract

### Citations

8973 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ... to learn an f that minimizes the expected risk, i.e., R p(x,y)(f) = E (x,y)∼p(x,y)[Q(f(x), y)], under some loss function Q(·), which will be formally discussed in Chapter 2. In transductive learning =-=[25]-=-, we are further given a set of unlabeled inputs {xi} m+n i=m+1 , and we only care about predicting as accurately as possible the labels {yi} m+n i=m+1 of these target inputs. Since a transduction alg... |

8564 |
Elements of Information Theory
- Cover, Thomas
- 2003
(Show Context)
Citation Context ...e rest of the work. Information theory was originally presented by Shannon in his paper “A Mathematical Theory of Communication”, though the discussion of this section is mainly based on Cover’s work =-=[86]-=-. Regarding the notation used in this section, we again use p(x) as a short-cut for pX(X = x), and we in general use scalar representations (e.g. x), but bear in mind that all concepts and theorems ar... |

8086 | Maximum likelihood from incomplete data
- Dempster, Laird, et al.
(Show Context)
Citation Context ...ed set of parameters can generate an observation sequence of arbitrary length. Moreover, generative models has principled ways to treat latent variables, typically using Expectation-Maximization (EM) =-=[65, 66]-=-. Despite these advantages, this approach is in general sub-optimal for classification tasks, as it intends to solve a more difficult density estimation problem rather than to optimize directly for cl... |

7048 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...j)∈Sm M� � m=1 (i,j)∈Sm #(i, j) #(i, j) , (7.3) where #(i, j) is the number of occurrences of event {Qt−1 = i, Qt = j} in the training data. The purpose of this dummy node is to provide soft evidence =-=[187, 188]-=- for Dt, and this evidence is encoded using the histogram of the M pitch transition patterns. Note that for the purposes of inference and decoding, the results should be identical with a π d m multipl... |

4823 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...input space to a feature space in which linear classification takes place. Given this mapping, the objective of SVM training coincides with that of training the last layer of an MLP with weight decay =-=[147]-=-, both of which can be expressed as min w,b λ 2 �w�2 + 1 m m� Q(f(xi), yi) (5.10) where �w� 2 is the squared ℓ2 norm, Q(·) is a loss function, and λ is an accuracy-regularization tradeoff coefficient.... |

2719 | Indexing by latent semantic analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...ing or data compression [5]. For example, density esti3s4 mation [6], clustering [7, 8], principle component (or surface) analysis [8], independent component analysis [9] and latent semantic analysis =-=[10, 11]-=- are all important forms of unsupervised learning. Notice that in some cases there exists another label variable ki, parallel to yi, that provides information on xi at a different level. For example, ... |

2309 | Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...er weight decay or early stopping can be viewed as max-margin training, and [153] showed that the use of relative entropy as a loss function also has a max-margin interpretation. As a side note, CRFs =-=[154]-=- and MaxEnt models [155, 156] has the same decision function form as Equation (5.9), and the same general training objective as Equation (5.10). In both cases, the choice of φ(x) is more flexible than... |

2308 | A decision-theoretic generalization of online learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...edundant given the SVs from the unadapted model. In Section 7.1.4, a classifier trained in this fashion will be referred to as a ”boosted” classifier since it strongly relates to the idea of boosting =-=[81, 82]-=-. In contrast, [133] took an exactly opposite strategy, which discarded the mis-classified adaptation samples while keeping the correctly-classified ones. The justification for this approach, however,... |

2281 | A tutorial on support vector machines for pattern recognition - Burges - 1998 |

2162 |
Density Estimation for Statistics and Data Analysis. Monographson Statistics and Applied Probability
- Silverman
- 1986
(Show Context)
Citation Context ...data. Unsupervised learning, in fact, often aims at building representations of the input that can be used for prediction, decision making or data compression [5]. For example, density esti3s4 mation =-=[6]-=-, clustering [7, 8], principle component (or surface) analysis [8], independent component analysis [9] and latent semantic analysis [10, 11] are all important forms of unsupervised learning. Notice th... |

2028 | Online learning with kernels - Kivinen, Smola, et al. - 2004 |

1695 | A theory of the learnable - Valiant - 1984 |

1328 |
Finding Groups in Data: An Introduction to Cluster Analysis
- Kaufman, Rousseuw
- 1990
(Show Context)
Citation Context ...ed learning, in fact, often aims at building representations of the input that can be used for prediction, decision making or data compression [5]. For example, density esti3s4 mation [6], clustering =-=[7, 8]-=-, principle component (or surface) analysis [8], independent component analysis [9] and latent semantic analysis [10, 11] are all important forms of unsupervised learning. Notice that in some cases th... |

1244 | Combining labeled and unlabeled data with co-training
- Blum, Mitchell
- 1998
(Show Context)
Citation Context ...it iteratively trains a seed classifier using the labeled data (sometimes with regularization [17]), and uses high-confidence predictions on the unlabeled data to expand the training set. Co-training =-=[18]-=- assumes that the input features can be split into two conditionally independent subsets, and that each subset is sufficient for classification. Under these assumptions, the algorithm trains two separ... |

1122 |
Statistical Analysis with Missing Data
- Little, Rubin
- 1987
(Show Context)
Citation Context ...ed set of parameters can generate an observation sequence of arbitrary length. Moreover, generative models has principled ways to treat latent variables, typically using Expectation-Maximization (EM) =-=[65, 66]-=-. Despite these advantages, this approach is in general sub-optimal for classification tasks, as it intends to solve a more difficult density estimation problem rather than to optimize directly for cl... |

1083 | A maximum entropy approach to natural language processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ... stopping can be viewed as max-margin training, and [153] showed that the use of relative entropy as a loss function also has a max-margin interpretation. As a side note, CRFs [154] and MaxEnt models =-=[155, 156]-=- has the same decision function form as Equation (5.9), and the same general training objective as Equation (5.10). In both cases, the choice of φ(x) is more flexible than that in SVMs or MLPs; it may... |

865 |
The Elements of
- Hastie, Tibshirani, et al.
- 2009
(Show Context)
Citation Context ...ed learning, in fact, often aims at building representations of the input that can be used for prediction, decision making or data compression [5]. For example, density esti3s4 mation [6], clustering =-=[7, 8]-=-, principle component (or surface) analysis [8], independent component analysis [9] and latent semantic analysis [10, 11] are all important forms of unsupervised learning. Notice that in some cases th... |

804 | Text classification from labeled and unlabeled documents using - Nigam, McCallum, et al. - 2000 |

721 | Boosting the Margin: A New Explanation for the Effective of Voting methods, The Annals of Statistics
- Bartlett, Freund, et al.
(Show Context)
Citation Context ...59]. Indeed, many machine learning algorithms adopt such an approach. For example, hinge loss used in SVM [25, 80], logistic loss used in logistic regression [8] and exponential loss used in boosting =-=[81, 82]-=- are all in the form of ˜Q(yf(x)), where their respective ˜ Q(·) are convex surrogates of the indicator function as shown in Figure 2.1. Next we introduce several such examples. First, the hinge loss ... |

701 | Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
- Platt
- 1999
(Show Context)
Citation Context ... probabilities, their outputs can be given probabilistic interpretations. For example, there have been approaches to fit SVM outputs to a probability function (e.g. sigmoid) to enable post-processing =-=[141]-=-. Here we assume that p(y|x, f) exists in all cases. Analogous to our discussion on generative classifiers, if we use the conditional likelihood loss Q(·) = − ln p(y|x, f), the unadapted model is then... |

682 | Transductive inference for text classification using support vector machines - Joachims - 1999 |

611 | Learning in Graphical Models
- Jordan, editor
- 1999
(Show Context)
Citation Context ...e slightly abuse notation to let f denote a generative model or its parameters, instead of a decision function. Well-known examples of this approach include Bayesian networks and Markov random fields =-=[64]-=-, while the parametric form of p(x, y|f) varies depending on specific models. A Bayesian network is a directed acyclic graph with vertices representing variables and edges representing dependence rela... |

592 | Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Comput
- Leggetter, Woodland
- 1995
(Show Context)
Citation Context ...as become one of the crucial techniques that any state-of-the-art ASR system cannot do without. Maximum likelihood linear regression (MLLR) is one popular framework for adapting Gaussian mixture HMMs =-=[101, 102]-=-, where clusters of model parameters are transformed through shared affine functions. These transformations shift the means and alter the covariance matrices of the Gaussians so that each HMM state is... |

553 | Inducing features of random fields
- Pietra, Pietra, et al.
- 1997
(Show Context)
Citation Context ... stopping can be viewed as max-margin training, and [153] showed that the use of relative entropy as a loss function also has a max-margin interpretation. As a side note, CRFs [154] and MaxEnt models =-=[155, 156]-=- has the same decision function form as Equation (5.9), and the same general training objective as Equation (5.10). In both cases, the choice of φ(x) is more flexible than that in SVMs or MLPs; it may... |

531 | Probabilistic latent semantic analysis
- Hofmann
- 1999
(Show Context)
Citation Context ...ing or data compression [5]. For example, density esti3s4 mation [6], clustering [7, 8], principle component (or surface) analysis [8], independent component analysis [9] and latent semantic analysis =-=[10, 11]-=- are all important forms of unsupervised learning. Notice that in some cases there exists another label variable ki, parallel to yi, that provides information on xi at a different level. For example, ... |

490 | Semisupervised learning using gaussian fields and harmonic functions - Zhu, Ghahramani, et al. - 2003 |

490 | C.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains
- Gauvain, Lee
- 1994
(Show Context)
Citation Context ...y p(f|x1:m) is equivalent to max f � � ln p(x1:m|f) + ln p(f) where p(x1:m|f) is the incomplete likelihood and p(f) is a prior distribution of f. There are three key problems regarding MAP estimation =-=[106]-=-: (1) how to define the functional form of the prior; (2) how to estimate the hyper-parameters of the prior; and (3) how to estimate model parameters given (1) and (2). The first problem is typically ... |

471 | A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models
- Bilmes
- 1998
(Show Context)
Citation Context ...− 1 m 65 m� ln � p(xi, yi, ki = k|f) (5.1) i=1 ks66 which can be not optimized directly. We can, however, iteratively minimize an upper bound of this using the expectation-maximization (EM) algorithm =-=[144]-=-. Specifically, we let δy(i) ∆ = I(yi = y) denote an indicator function which equals one only when xi belongs to class y, and let L k|y(i) ∆ = p(ki = k|xi, yi = y, f g ) denote the component occupancy... |

467 | Multitask learning
- Caruana
- 1997
(Show Context)
Citation Context ...spaces. For example, Fθ can be represented by a space of multi-layer perceptrons where the first two layers, parameterized by θ, are shared by all tasks, while the remaining layers are task-dependent =-=[28, 33]-=-. Another form of relatedness is given by [34] and [32], where the decision function f k is assumed to be a linear combination of a task-independent and a task-dependent component. Furthermore, [31] d... |

458 | Connectionist Speech Recognition. A Hybrid Approach - BOURLARD, MORGAN - 1993 |

450 | Semi-supervised learning literature survey,” 2006. [Online]. Available: http://pages.cs.wisc.edu/ jerryzhu/pub/ssl survey 12
- Zhu
(Show Context)
Citation Context ... but labeling is expensive and time-consuming. The basic idea of semi-supervised learning is to use the input distribution learned from the unlabeled data to influence the supervised learning problem =-=[12]-=-. In the probabilistic framework, semi-supervised learning can be treated as a missing data problem, which can be addressed by generative models using the EM algorithm and extensions thereof [13–15]. ... |

434 | Learning with local and global consistency
- Zhou, Bousquet, et al.
(Show Context)
Citation Context ...the conditional relation p(y|x) remains the same. In several learning paradigms, this type of difference has been partially accounted for by explicitly taking into account the test input distribution =-=[21,136]-=-. A learning setting that has not received as much theoretical attention is that of ”adaptive learning”, which studies a more general case where both p(x) and p(y|x) at test time vary from their train... |

397 | Exploiting generative models in discriminative classifiers
- Jaakkola, Haussler
- 1998
(Show Context)
Citation Context ...as discriminative training and structural discriminability that we have mentioned, there are approaches that exploit generative models in discriminative classifiers, e.g. the use of the Fisher kernel =-=[75]-=-. 2.2 Loss Functions for Parameter Estimation As mentioned before, a risk function using the 0-1 loss corresponds to classification error rate, which is what we typically use in evaluating a classifie... |

334 | Selective sampling using the query by committee algorithm
- Freund, Seung, et al.
- 1997
(Show Context)
Citation Context .... Additionally, active learning is a similar setting as semi-supervised learning, but it allows an intelligent choice of which samples to label. For example, query learning [23] or selective sampling =-=[24]-=- generates or selects the most informative inputs for the human expert to label with the hope to improve classification performance with the minimal amount of queries.s1.1.2 Inductive and transductive... |

320 | A framework for learning predictive structures from multiple tasks and unlabeled data
- Ando, Zhang
- 1999
(Show Context)
Citation Context ...lti-layer perceptrons where the first two layers, parameterized by θ, are shared by all tasks, while the remaining layers are task-dependent [28, 33]. Another form of relatedness is given by [34] and =-=[32]-=-, where the decision function f k is assumed to be a linear combination of a task-independent and a task-dependent component. Furthermore, [31] defines relatedness between tasks on the basis of simila... |

290 | Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines”, Microsoft Research
- Platt
(Show Context)
Citation Context ... in this case. Regarding implementation, our optimization algorithm is developed based on SVMTorch [160], which uses a slightly modified version of the sequential minimal optimization (SMO) algorithm =-=[161]-=-. SMO is a conceptually simple algorithm with better scaling properties than a standard chunking algorithm. In the inner loop of the quadratic programming, SMO empirically chooses two αi’s at a time a... |

275 | Classification by pairwise coupling
- Hastie, Tibshirani
- 1996
(Show Context)
Citation Context ...class. We adopt this scheme in our classification experiments since it is easy to implement and empirically works well [63]. But there are alternative approaches such as building pairwise classifiers =-=[150]-=- or using a training objective designed explicitly for multiclass classification [71, 151]. Secondly, we inspect these aspects in a two-layer MLP. For consistency with a binary SVM, we study a binary ... |

269 |
Discrete-Time Processing of Speech Signals
- Deller, Proakis, et al.
- 1993
(Show Context)
Citation Context ...ity detection (VAD), and to extract low-level features such as average energy, number of zero-crossings, normalized autocovariance coefficients (NACCs) and mel frequency cepstral coefficients (MFCCs) =-=[171, 172]-=-, all of which are A o usAcoustic Signal Mouse Movement Signal Processing VAD Feature Extraction Computer Interface Driver Low-level Features: Energy Zero-crossings NCCCs MFCCs Motion Parameters: Dire... |

268 | Learning from labeled and unlabeled data using graph mincuts
- Blum, Chawla
(Show Context)
Citation Context ...les are used to enlarge the training set of the other. This approach often improves over self-training, as compared in [19]. Another school of methods for semi-supervised learning are based on graphs =-=[20]-=-, where nodes represent labeled and unlabeled samples, and edges reflect the similarity between the samples. Given such a graph, we desire to find a decision function that satisfies the constraints im... |

252 | SVMTorch: Support Vector Machines for Large-Scale Regression Problems
- Collobert, Bengio
- 2001
(Show Context)
Citation Context ... w and b are necessary to classification — and the “extended regularized” algorithm is not applicable in this case. Regarding implementation, our optimization algorithm is developed based on SVMTorch =-=[160]-=-, which uses a slightly modified version of the sequential minimal optimization (SMO) algorithm [161]. SMO is a conceptually simple algorithm with better scaling properties than a standard chunking al... |

222 |
Generalized plan recognition
- Kautz, Allen
- 1986
(Show Context)
Citation Context ...o move the mouse diagonally. 7.2.3 An online adaptive filter There has been research on plan recognition which aims to infer the plans of an intelligent agent from observations of the agent’s actions =-=[192]-=-. A recent trend in approaching the plan recognition problem is to first construct a dynamic Bayesian network (DBN) for plan execution and then to apply inference on this model [193, 194]. To model th... |

217 | Multi-class support vector machines
- Weston, Watkins
- 1998
(Show Context)
Citation Context ...ement and empirically works well [63]. But there are alternative approaches such as building pairwise classifiers [150] or using a training objective designed explicitly for multiclass classification =-=[71, 151]-=-. Secondly, we inspect these aspects in a two-layer MLP. For consistency with a binary SVM, we study a binary MLP which is depicted in Figure 5.1. In this example, we have D input units represented by... |

191 | Analyzing the effectiveness and applicability of co-training
- Nigam, Ghani
- 2000
(Show Context)
Citation Context ... two subsets of features, and each classifier’s predictions on new unlabeled samples are used to enlarge the training set of the other. This approach often improves over self-training, as compared in =-=[19]-=-. Another school of methods for semi-supervised learning are based on graphs [20], where nodes represent labeled and unlabeled samples, and edges reflect the similarity between the samples. Given such... |

186 | Handwritten digit recognition with a back-propagation network
- LeCun, Boser, et al.
- 1992
(Show Context)
Citation Context ...orks, such as the use of radius basis functions [147]. Moreover, the parameters in φ(·), namely the input-to-hidden layer weights, can be optimized systematically using the back-propagation algorithm =-=[147, 152]-=-. In training an MLP, we minimize the conditional likelihood loss (or equivalently the logistic loss) Q(f(xi), yi) = p(yi|xi, f) which is equivalent to minimizing the relative entropy between the true... |

181 |
Introduction to Stochastic Search and Optimization
- Spall
- 2003
(Show Context)
Citation Context ...e negative logarithm of the joint likelihood, where we omit the adjectives for simplicity i=1 y 19s20 77]. Usually there is no globally optimal solution to this objective; stochastic gradient descent =-=[78]-=-, or, in some cases, the extended Baulm-Welch algorithm [79] can be utilized to find a local optimum. Although MMIE demonstrates significant performance advantages over the traditional MLE approach [6... |

168 | Incremental and decremental support vector machine learning
- Cauwenberghs, Poggio
- 2001
(Show Context)
Citation Context ... or sequential learning paradigm (see [157] for early work on incremental learning). Incremental SVM learning was originally proposed to scale up inductive learning algorithms for very large datasets =-=[134,158,159]-=-. As discussed in the last section, exact SVM training requires to solve a quadratic programming problem in a number of coefficients equal to the number of training samples, thereby making largescale ... |

165 | Learning methods for generic object recognition with invariance to pose and lighting
- LeCun, Huang, et al.
(Show Context)
Citation Context ...longing to this class. For each object, there can be a good number of images taken under different lighting conditions and from different angles. Our corpus is a subset of the normalized NORB dataset =-=[168]-=-, where the images were segmented, normalized and then composed in the center of 96x96 pixel background images. Our dataset consists of images from 5 classes, i.e., airplanes, cars, trucks, human figu... |

163 | Learning structured prediction models: A large margin approach
- Taskar, Chatalbashev, et al.
- 2005
(Show Context)
Citation Context ...e exponential 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 2.1: Convex surrogates of the 0-1 loss and is recently utilized in discriminative training of structured generative models such as Markov networks =-=[71]-=- and Gaussian mixture models [72]. Secondly, the logistic loss is defined as Q(f(x), y) = ln 1 1 + exp(−yf(x)) 21 (2.11) This coincides with the form of the conditional likelihood loss when the condit... |

153 | Kernels for Multitask Learning - Micchelli, Pontil - 2005 |

153 | The hardness of approximate optima in lattices, codes, and systems of linear equations
- Arora, Babai, et al.
- 1997
(Show Context)
Citation Context ...valuating the classification performance. In this case the expected risk is the true classification error rate. The 0-1 loss, however, is often computationally intractable as an optimization function =-=[58]-=-; surrogates of the 0-1 loss are typically used in actually training a classifier, which will be discussed in Section 2.2 in detail. 15 (2.2) Regardless of what loss function to use, the expected risk... |