## Domain Adaptation of Natural Language Processing Systems (2007)

Citations: | 15 - 1 self |

### BibTeX

@TECHREPORT{Blitzer07domainadaptation,

author = {John Blitzer and Fernando Pereira and Rajeev Alur},

title = {Domain Adaptation of Natural Language Processing Systems},

institution = {},

year = {2007}

}

### OpenURL

### Abstract

My first thanks must go to Fernando Pereira. He was a wonderful advisor, and every aspect of this thesis has benefitted from his insight. At times I was a difficult, even unruly graduate student, and Fernando had patience with all my ideas, whether good or bad. What I’ll miss most, though, is the quick trip to Fernando’s office, coming away with new insights on everything from numerical underflow to the state of the academic community in machine learning and NLP. In addition to Fernando, this thesis was shaped by a great committee. Having Ben Taskar as committee chairman has given me the perfect excuse to interrupt his workday with new, ostensibly-thesis-related machine learning ideas. Mark Liberman and Mitch Marcus brought a much-needed linguistic perspective to a thesis on language, and many of the techniques described are based on work by Tong Zhang, who kindly served as my external committee member. Although he didn’t directly serve on my committee, Shai Ben-David got me started on the theoretical aspects of this work, and chapter 4 grew out of work I co-authored with him. I was also fortunate to have a great academic family. With brothers (and one sister!)

### Citations

8983 | N.: The nature of statistical learning theory
- Vapnik
- 1995
(Show Context)
Citation Context ...ability at least 1 − δ (over the choice of the samples), for every h ∈ H |ˆɛα(h) − ɛα(h)| < √ α 2 β √ (1 − α)2 d log(2m) − log δ + 1 − β 2m The proof is similar to standard uniform convergence proofs =-=[64, 6]-=-, but it uses Hoeffding’s inequality in a different way because the bound on the range of the random variables underlying the inequality varies with α and β. The lemma shows that as α moves away from ... |

2311 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...labels themselves have internal structure that we can take advantage of in designing the mapping ζ(x,y). Because of this, models which solve these tasks are often referred to as structured predictors =-=[42, 57, 60]-=-. Methods for structured prediction must factor problems so as to be able to perform computationally efficient inference and to be able to make accurate predictions. When we investigate adapting part ... |

2105 | Building a Large Annotated Corpus of English: The Penn Treebank
- Marcus, Marcinkiewicz, et al.
- 1993
(Show Context)
Citation Context ...speech tagging systems must be deployed in a variety of domains. In this section, we show how to use SCL to adapt a tagger from a standard resource, the Penn treebank Wall Street Journal (WSJ) corpus =-=[46]-=- to a new corpus of MEDLINE abstracts [52]. The Penn BioIE project [52] focuses on building information extraction and natural language processing systems for biomedical text. We obtained a corpus fro... |

2048 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...either positive or negative) and returns the top-scoring label for this input (right). decades. A complete discussion is well beyond the scope of this thesis, but we refer the reader to Hastie et al. =-=[34]-=- for an introduction to supervised learning and to Manning and Schütze [45] and Jelinek [38] for overviews of its use in natural language processing. Shawe-Taylor and Cristianini [58] is a good refere... |

1491 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...ion D ɛD(h) = E(x,y)∼D [h(x) ̸= y] . Suppose that we choose a hypothesis from a class of finite cardinality H. In this case, we may relate training and generalization error via Hoeffding’s inequality =-=[35]-=- and the union bound. For a training sample {xi,yi} N i=1 drawn from D, with probability 1 − δ, for every h ∈ H, ɛD(h) ≤ 1 n n∑ √ 2 log(2|H|) − log δ [h(xi) ̸= yi] + . (1.2) n i=1 This result is a sli... |

947 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ... the bound from equation 1.2 does not apply. We may still state a uniform convergence result, however, through a measure of hypothesis class complexity known as the Vapnik-Chervonenkis (VC) dimension =-=[65]-=-. For a given training sample S of size N, the number of possible unique partitions of the points into 9two classes is 2 N . But note that for a given dimension d and number of points N, not all part... |

739 |
Statistical Methods for Speech Recognition
- Jelinek
- 1998
(Show Context)
Citation Context ...es. A complete discussion is well beyond the scope of this thesis, but we refer the reader to Hastie et al. [34] for an introduction to supervised learning and to Manning and Schütze [45] and Jelinek =-=[38]-=- for overviews of its use in natural language processing. Shawe-Taylor and Cristianini [58] is a good reference for support vector machines, a particularly popular method for training and representing... |

614 | Thumbs up? sentiment classification using machine learning techniques
- Pang, Lee, et al.
- 2002
(Show Context)
Citation Context ...iment classification system receives as input a document and outputs a label indicating the sentiment (positive or negative) of the document. This problem has received considerable attention recently =-=[51, 63, 32]-=-. While movie reviews have been the most studied domain, sentiment analysis has been extended to a number of new domains, ranging from stock message boards to congressional floor debates [25, 61]. Res... |

492 | Unsupervised word sense disambiguation rivaling supervised methods
- Yarowsky
- 1995
(Show Context)
Citation Context ... “processing” are more similar than either one is to “accomplishments” is necessary to give correct distances here. 2.4.2 Bootstrapping Another paradigm for exploiting unlabeled data is bootstrapping =-=[67, 17, 49, 21, 1, 47]-=-. Bootstrapping methods begin with an initial classifier. They label unlabeled instances with this classifier. Then they choose some subset of the newly-labeled instances to create a new training set ... |

490 | Semisupervised learning using gaussian fields and harmonic functions
- Zhu, Ghahramani, et al.
- 2003
(Show Context)
Citation Context ...omplete survey of semi-supervised learning methods. 2.4.1 Manifold regularization Procedurally, the most similar methods to structural learning are those which learn a regularizer from unlabeled data =-=[68, 9, 71]-=-. Like structural learning, these methods regularize parameters by enforcing smoothness in some underlying subspace. The assumptions on the structure of the subspace are quite different, though. The s... |

488 | Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. The ACL02 conference on Empirical methods in natural language processingVolume 10
- Collins
- 2002
(Show Context)
Citation Context ...orrect instance by a margin. Crammer et al. [22] give a more complete description of the MIRA algorithm. The application of MIRA to structured prediction is strongly influenced by the work of Collins =-=[19]-=-, who described an application of the perceptron to structured prediction. For a more general discussion and comparison of optimization techniques for structured predictors, we again refer to Taskar [... |

450 | Semi-supervised learning literature survey
- Zhu
- 2007
(Show Context)
Citation Context ...though. In this section, we briefly review semi-supervised and unsupervised methods in text, with an emphasis on applicability for domain adaptation. Our survey here is necessarily brief, but see Zhu =-=[70]-=- for a more complete survey of semi-supervised learning methods. 2.4.1 Manifold regularization Procedurally, the most similar methods to structural learning are those which learn a regularizer from un... |

444 | Shallow parsing with conditional random fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...labels themselves have internal structure that we can take advantage of in designing the mapping ζ(x,y). Because of this, models which solve these tasks are often referred to as structured predictors =-=[42, 57, 60]-=-. Methods for structured prediction must factor problems so as to be able to perform computationally efficient inference and to be able to make accurate predictions. When we investigate adapting part ... |

442 | A maximum entropy model for part-of-speech tagging
- Ratnaparkhi
- 1996
(Show Context)
Citation Context ...features. Given a sentence, the task of a part of speech tagger is to label each word with its grammatical function. The best part of speech taggers encode a sentence label as a chainstructured graph =-=[53, 20, 62]-=-. In this formulation, the part of speech label factors along the cliques of the graph. We will design pivot features for individual cliques and the input features associated with them. Consider the e... |

434 | Unsupervised models for named entity classification
- Collins, Singer
- 1999
(Show Context)
Citation Context ... “processing” are more similar than either one is to “accomplishments” is necessary to give correct distances here. 2.4.2 Bootstrapping Another paradigm for exploiting unlabeled data is bootstrapping =-=[67, 17, 49, 21, 1, 47]-=-. Bootstrapping methods begin with an initial classifier. They label unlabeled instances with this classifier. Then they choose some subset of the newly-labeled instances to create a new training set ... |

429 | Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews
- Turney
- 2002
(Show Context)
Citation Context ...iment classification system receives as input a document and outputs a label indicating the sentiment (positive or negative) of the document. This problem has received considerable attention recently =-=[51, 63, 32]-=-. While movie reviews have been the most studied domain, sentiment analysis has been extended to a number of new domains, ranging from stock message boards to congressional floor debates [25, 61]. Res... |

345 | Feature-rich part-of-speech tagging with a cyclic dependency network
- Toutanova, Klein, et al.
- 2003
(Show Context)
Citation Context ...features. Given a sentence, the task of a part of speech tagger is to label each word with its grammatical function. The best part of speech taggers encode a sentence label as a chainstructured graph =-=[53, 20, 62]-=-. In this formulation, the part of speech label factors along the cliques of the graph. We will design pivot features for individual cliques and the input features associated with them. Consider the e... |

319 | A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
- Ando, Zhang
(Show Context)
Citation Context ...s from different domains for both tasks. . . . . . . 13 2.1 ASO algorithm as it is implemented in practice . . . . . . . . . . . . . . 23 2.2 An illustration of block SVD by type, from Ando and Zhang =-=[3]-=-. . . . . . 25 2.3 SCL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 An example of how feature vector hamming distance can be misleading. . 38 3.1 A discriminating pro... |

310 |
Neural Network Learning: Theoretical Foundations
- Anthony, Bartlett
- 1999
(Show Context)
Citation Context ...ive theoretical results for generalization to new domains. Once again, we focus on binary classification and are necessarily brief, but we refer to 8Kearns and Vazirani [41] and Anthony and Bartlett =-=[6]-=- for excellent introductions to the concepts of learning theory. As before, we denote by x ∈ X a feature vector in feature space. Formally, suppose that instances are drawn from a probability distribu... |

292 | Online passive-aggressive algorithms
- Crammer, Dekel, et al.
(Show Context)
Citation Context ...ore complex than for binary classification, but many of the basic aspects are similar. In this thesis, when we solve structured prediction problems, we use the margin infused relaxed algorithm (MIRA) =-=[22]-=-. MIRA is an online algorithm which updates the parameter vector each instance to give the minimum change to the weight vector (as measured by the L2 norm) to separate the correct instance from the to... |

255 | A syntax-based statistical translation model
- Yamada, Knight
- 2001
(Show Context)
Citation Context ...cal problem in text processing, and it serves as a first step in many pipelined systems, including higher-level syntactic processing [21, 47], information extraction [56, 52], and machine translation =-=[66]-=-. Because of their fundamental role, part of speech tagging systems must be deployed in a variety of domains. In this section, we show how to use SCL to adapt a tagger from a standard resource, the Pe... |

225 | Online large-margin training of dependency parsers
- McDonald, Crammer, et al.
- 2005
(Show Context)
Citation Context ... text processing systems. Here we show that improving a part of speech tagger in a new domain can improve a dependency parser in the new domain as well. We use the parser described by McDonald et al. =-=[48]-=-. That parser assumes that a sentence has been PoS-tagged before parsing, so it is a straightforward match for our experiments here. Accuracy 82 78 74 70 66 62 58 Dependency Parsing for 561 Test Sente... |

196 | Some statistical issues in the comparison of speech recognition algorithms
- GiIlick, Cox
- 1989
(Show Context)
Citation Context ...s. Figure 3.7(b) gives results for 40,000 sentences, and Figure 3.7(c) shows corresponding significance tests, with p < 0.05 being significant. We use a McNemar paired test for labeling disagreements =-=[31]-=-. Even when we use all the 57WSJ training data available, the SCL model significantly improves accuracy over both the supervised and ASO baselines. SCL is designed to improve the accuracies for unkno... |

176 | Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales
- Lee, Pang
- 2005
(Show Context)
Citation Context ...the work of Pang et al. [51], which we use as our baseline. Thomas et al. [61] use discourse structure present in congressional records to perform more accurate sentiment classification. Pang and Lee =-=[50]-=- treat sentiment analysis as an ordinal ranking problem. In our work we only show improvement for the basic model, but all of these new techniques also make use of lexical features. Thus we believe th... |

163 | Learning structured prediction models: A large margin approach
- Taskar, Chatalbashev, et al.
- 2005
(Show Context)
Citation Context ...labels themselves have internal structure that we can take advantage of in designing the mapping ζ(x,y). Because of this, models which solve these tasks are often referred to as structured predictors =-=[42, 57, 60]-=-. Methods for structured prediction must factor problems so as to be able to perform computationally efficient inference and to be able to make accurate predictions. When we investigate adapting part ... |

162 | Canonical Correlation Analysis: An Overview with Application to Learning Methods
- Hardoon, Szedmak, et al.
(Show Context)
Citation Context ...oint distribution on both views. Canonical correlation analysis finds two sets of basis vectors such that for all k, the projections of X (1) and X (2) onto the first k bases are maximally correlated =-=[36, 33]-=-. Let C be the joint covariance matrix for ( X (1),X(2)) C = Ex∼D [xx ′ ] . We may write C in the block form C = ⎡ ⎣ C11 C21 C12 C22 ⎤ ⎦ , [ (1) (1) where C11 = Ex (1) ∼D x x ′] and likewise for the o... |

162 | Learning to classify text from labeled and unlabeled documents
- Nigam, McCallum, et al.
(Show Context)
Citation Context ... “processing” are more similar than either one is to “accomplishments” is necessary to give correct distances here. 2.4.2 Bootstrapping Another paradigm for exploiting unlabeled data is bootstrapping =-=[67, 17, 49, 21, 1, 47]-=-. Bootstrapping methods begin with an initial classifier. They label unlabeled instances with this classifier. Then they choose some subset of the newly-labeled instances to create a new training set ... |

152 | Domain adaptation with structural correspondence learning
- Blitzer, McDonald, et al.
- 2006
(Show Context)
Citation Context ...t combining SCL with these methods yields still greater improvements, reducing error due to adaptation by as much as forty percent. The results in this chapter are drawn primarily from Blitzer et al. =-=[16]-=- and Blitzer et al. [15]. 423.1 Adapting a sentiment classification system A sentiment classification system receives as input a document and outputs a label indicating the sentiment (positive or neg... |

146 |
Domain adaptation for statistical classifiers
- Daumé, Marcu
- 2006
(Show Context)
Citation Context ...mial parameters of a generative parsing model to combine a large amount of training data from a source corpus (WSJ), and small amount of training data from a target corpus (Brown). Daume 61and Marcu =-=[28]-=- use an empirical Bayes model to estimate a latent variable model grouping instances into domain-specific or common across both domains. They also jointly estimate the parameters of the common classif... |

130 | Correcting sample selection bias by unlabeled data
- Huang, Smola, et al.
- 2007
(Show Context)
Citation Context ...blem that is very closely-related to domain adaptation is the problem of covariate shift (also called sample selection bias), which has been studied in the machine learning and statistics communities =-=[59, 37]-=-. Here we assume the conditional distributions PrDS [y|x] and PrDT [y|x] are identical, but the instance marginal distributions PrDS [x] and PrDT [x] are different. Several researchers have studied al... |

109 | Analysis of representations for domain adaptation
- Ben-David, Blitzer, et al.
- 2007
(Show Context)
Citation Context ...f when adaptation techniques work, as well as how to best exploit the resources we have. This chapter develops a theoretical framework for domain adaptation and comprises the work of Ben-David et al. =-=[10]-=- and Blitzer et al. [14]. We first show how to use this framework to prove bounds on the target error for classifiers which are trained in a source domain. We then demonstrate how to use the bound to ... |

106 |
bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification
- Biographies
(Show Context)
Citation Context ...se methods yields still greater improvements, reducing error due to adaptation by as much as forty percent. The results in this chapter are drawn primarily from Blitzer et al. [16] and Blitzer et al. =-=[15]-=-. 423.1 Adapting a sentiment classification system A sentiment classification system receives as input a document and outputs a label indicating the sentiment (positive or negative) of the document. ... |

94 | Detecting change in data streams
- Kifer, Ben-David, et al.
- 2004
(Show Context)
Citation Context ...vantage over other methods for comparing distributions such as L1 distance or the KL divergence: we can compute dH using finite samples from the distributions D and D ′ when H has finite VC dimension =-=[12]-=-. Furthermore, as the following theorem shows, we can compute a finite-sample approximation to dH by finding a classifier h ∈ H that maximally discriminates between instances from D and D ′. Theorem 2... |

94 | Get out the vote: Determining support or opposition from Congressional floor-debate transcripts
- Thomas, Pang, et al.
- 2006
(Show Context)
Citation Context ... [51, 63, 32]. While movie reviews have been the most studied domain, sentiment analysis has been extended to a number of new domains, ranging from stock message boards to congressional floor debates =-=[25, 61]-=-. Research results have been deployed industrially in systems that gauge market reaction and summarize opinion from web pages, discussion boards, and blogs. With such widely-varying domains, researche... |

78 | Discriminative learning for differing training and test distributions
- Bickel, Brückner, et al.
- 2007
(Show Context)
Citation Context ... authors have empirically studied a special case of this in which each instance is weighted separately in the loss function, and instance weights are set to approximate the target domain distribution =-=[37, 13, 24, 39]-=-. We give a uniform convergence bound for algorithms that minimize a convex combination of multiple empirical source errors and we show that these algorithms can outperform standard empirical error mi... |

75 |
PAC generalization bounds for cotraining
- Dasgupta, Littman, et al.
- 2002
(Show Context)
Citation Context ...to split the feature space into multiple 30“views” [17]. Learning in the two views model proceeds by training separate classifiers for each view and requiring that they “agree” on the unlabeled data =-=[26, 1, 2, 29, 55]-=-. In this section, we show how to relate ASO and SCL to new theoretical work on using canonical correlation analysis for multiple view learning [36, 40]. We show that a variant of the ASO optimization... |

72 |
Adaptation of maximum entropy capitalizer: little data can help a lot
- Chelba, Acero
- 2004
(Show Context)
Citation Context ... idea of using the regularizer of a linear model to encourage the target parameters to be close to the source parameters has been used previously in domain adaptation. In particular, Chelba and Acero =-=[18]-=- showed how this technique can be effective for capitalization adaptation. The major difference between our approach and theirs is that we penalize deviation from the source parameters for the weights... |

69 | A Statistical Model for Multilingual Entity Detection and Tracking
- Florian, Hassan, et al.
- 2004
(Show Context)
Citation Context ... is probably insufficient to significantly change w, but we 50can correct v, which is much smaller. We augment each labeled target instance xj with the label assigned by the source domain classifier =-=[30, 16]-=-. Then we solve minw,v ∑ j L (w′ xj + v ′ θxj,yj) + λ||w|| 2 +µ||v − vs|| 2 . Since we don’t want to deviate significantly from the source parameters, we set λ = µ = 10 −1 . Figure 3.4 shows the corre... |

66 |
Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization
- Goldberg, Zhu
- 2004
(Show Context)
Citation Context ...iment classification system receives as input a document and outputs a label indicating the sentiment (positive or negative) of the document. This problem has received considerable attention recently =-=[51, 63, 32]-=-. While movie reviews have been the most studied domain, sentiment analysis has been extended to a number of new domains, ranging from stock message boards to congressional floor debates [25, 61]. Res... |

61 | Learning bounds for domain adaptation
- Blitzer, Crammer, et al.
- 2007
(Show Context)
Citation Context ...ques work, as well as how to best exploit the resources we have. This chapter develops a theoretical framework for domain adaptation and comprises the work of Ben-David et al. [10] and Blitzer et al. =-=[14]-=-. We first show how to use this framework to prove bounds on the target error for classifiers which are trained in a source domain. We then demonstrate how to use the bound to estimate the adaptation ... |

60 |
Yahoo! for Amazon : Extracting Market Sentiment from Stock Message Boards
- Das, Chen
- 2001
(Show Context)
Citation Context ... [51, 63, 32]. While movie reviews have been the most studied domain, sentiment analysis has been extended to a number of new domains, ranging from stock message boards to congressional floor debates =-=[25, 61]-=-. Research results have been deployed industrially in systems that gauge market reaction and summarize opinion from web pages, discussion boards, and blogs. With such widely-varying domains, researche... |

55 | Parsing Biomedical Literature
- Lease, Charniak
- 2005
(Show Context)
Citation Context ...h tagging While the literature on unsupervised part of speech tagging is quite large, to the best of our knowledge, we are the first to adapt part of speech taggers to new domains. Lease and Charniak =-=[43]-=- adapt a WSJ parser to biomedical text without any biomedical treebanked data. However, they assume other labeled resources in the target domain. In section 3.2.3 we give similar parsing results, but ... |

52 | Customizing Sentiment Classifiers to New Domains: A Case Study - Aue, Gamon - 2005 |

46 | Nonparametric transforms of graph kernels for semi-supervised learning
- Zhu, Kandola, et al.
- 2005
(Show Context)
Citation Context ..., for a real-valued predictor, we require that the predictions be close for points that are close. How can we decide which points are close, though? One way to proceed is the data manifold assumption =-=[9, 72]-=-: We assume that the input instances x are sampled from a low dimensional manifold. The neighborhood graph on the unlabeled data can provide us with an indication of the structure of that manifold. Su... |

44 | Understanding the yarowsky algorithm
- Abney
- 2004
(Show Context)
Citation Context ...to split the feature space into multiple 30“views” [17]. Learning in the two views model proceeds by training separate classifiers for each view and requiring that they “agree” on the unlabeled data =-=[26, 1, 2, 29, 55]-=-. In this section, we show how to relate ASO and SCL to new theoretical work on using canonical correlation analysis for multiple view learning [36, 40]. We show that a variant of the ASO optimization... |

42 |
Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples
- Belkin
- 2006
(Show Context)
Citation Context ...omplete survey of semi-supervised learning methods. 2.4.1 Manifold regularization Procedurally, the most similar methods to structural learning are those which learn a regularizer from unlabeled data =-=[68, 9, 71]-=-. Like structural learning, these methods regularize parameters by enforcing smoothness in some underlying subspace. The assumptions on the structure of the subspace are quite different, though. The s... |

42 |
Input-dependent estimation of generalization error under covariate shift
- Sugiyama, Müller
- 2005
(Show Context)
Citation Context ...blem that is very closely-related to domain adaptation is the problem of covariate shift (also called sample selection bias), which has been studied in the machine learning and statistics communities =-=[59, 37]-=-. Here we assume the conditional distributions PrDS [y|x] and PrDT [y|x] are identical, but the instance marginal distributions PrDS [x] and PrDT [x] are different. Several researchers have studied al... |

35 | Learning from multiple sources
- Crammer, Kearns, et al.
- 2008
(Show Context)
Citation Context ...s result can be found Appendix A.2. The main step in the proof is a variant of the triangle inequality in which the sides of the triangle represent errors of one decision rule with respect to another =-=[10, 23]-=-. The bound is relative to λ. When the combined error of the ideal hypothesis is large, there is no classifier that performs well on both the source and target domains, so we cannot hope to find a goo... |

35 | Two view learning: SVM-2K, theory and practice
- Farquhar, Hardoon, et al.
- 2005
(Show Context)
Citation Context ...to split the feature space into multiple 30“views” [17]. Learning in the two views model proceeds by training separate classifiers for each view and requiring that they “agree” on the unlabeled data =-=[26, 1, 2, 29, 55]-=-. In this section, we show how to relate ASO and SCL to new theoretical work on using canonical correlation analysis for multiple view learning [36, 40]. We show that a variant of the ASO optimization... |

34 | On the difficulty of approximately maximizing agreements - Ben-David, Eiron, et al. - 2003 |