DMCA
Natural language processing (almost) from scratch (2011)
Cached
Download Links
Citations: | 248 - 18 self |
Citations
8903 |
Probabilistic reasoning in intelligent systems: networks of plausible inference
- Pearl
- 1988
(Show Context)
Citation Context ...ribes all the tasks in the same probabilistic framework. Separately training a submodel only makes sense when the training data blocks these additional dependency paths (in the sense of d-separation, =-=Pearl, 1988-=-). This implies that, without joint training, the additional dependency paths cannot directly involve unobserved variables. Therefore, the natural idea of discovering common internal representations a... |
5884 | A tutorial on hidden Markov models and selected applications in speech recognition,” - Rabiner - 1989 |
3482 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...e case, the scores could be viewed as the logarithms of conditional transition probabilities, and our model would be subject to the label-bias problem that motivates Conditional Random Fields (CRFs) (=-=Lafferty et al., 2001-=-). The denormalized scores should instead be likened to the potential functions of a CRF. In fact, a CRF maximizes the same likelihood (13) using a linear model instead of a nonlinear neural network. ... |
2472 | An algorithm for suffix stripping - Porter - 1980 |
1158 | Head-driven statistical models for natural language processing.
- Collins
- 2003
(Show Context)
Citation Context ...site for semantic role labeling (Gildea and Palmer, 2002). This is why state-of-the-art semantic role labeling systems thoroughly exploit multiple parse trees. The parsers themselves (Charniak, 2000; =-=Collins, 1999-=-) contain considerable prior information about syntax (one can think of this as a kind of informed pre-processing). Our system does not use such parse trees because we attempt to learn this informatio... |
986 | Class-based n-gram models of natural language.
- Brown, DellaPietra, et al.
- 1992
(Show Context)
Citation Context .... However, word representations are perhaps more commonly inferred from n-gram language modeling rather than purely continuous language models. One popular approach is the Brown clustering algorithm (=-=Brown et al., 1992-=-a), which builds hierarchical word clusters by maximizing the bigram’s mutual information. The induced word representation has been used with success in a wide variety of NLP tasks, including POS (Sch... |
971 | A maximum-entropy-inspired parser
- Charniak
- 2000
(Show Context)
Citation Context ...cessary prerequisite for semantic role labeling (Gildea and Palmer, 2002). This is why state-of-the-art semantic role labeling systems thoroughly exploit multiple parse trees. The parsers themselves (=-=Charniak, 2000-=-; Collins, 1999) contain considerable prior information about syntax (one can think of this as a kind of informed pre-processing). Our system does not use such parse trees because we attempt to learn ... |
970 | A fast learning algorithm for deep belief nets - Hinton, Osindero, et al. |
892 | Transductive inference for text classification using support vector machines. - Joachims - 1999 |
771 | Learning quickly when irrelevant attributes abound: a new linear threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ...SRL systems use a lot of features, including POS tags, head words, phrase type, path in the parse tree from the verb to the node. . . . Koomen et al. (2005) hold the state-of-the-art with Winnowlike (=-=Littlestone, 1988-=-) classifiers, followed by a decoding stage based on an integer program that enforces specific constraints on SRL tags. They reach 77.92% F1 on CoNLL 2005, thanks to the five top parse trees produced ... |
747 | Automatic labeling of semantic roles,” - Gildea, Jurafsky - 2002 |
691 | Feature-rich part-ofspeech tagging with a cyclic dependency network
- Toutanova, Klein, et al.
- 2003
(Show Context)
Citation Context ...ks are evaluated by computing the F1 scores over chunks produced by our models. The POS task is evaluated by computing the per-word accuracy, as it is the case for the standard benchmark we refer to (=-=Toutanova et al., 2003-=-). We used the conlleval script 5 for evaluating POS, 6 NER and CHUNK. For SRL, we used the srl-eval.pl script included in the srlconll package. 7 2.6 Discussion When participating in an (open) challe... |
663 | RCV1: a new benchmark collection for text categorization research - Lewis, Yang, et al. - 2004 |
581 | Shallow Parsing with Conditional Random Fields. In:
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...moto, 2001) using an ensemble of classifiers trained with different tagging conventions (see Section 3.3.3). Since then, a certain number of systems based on second-order random fields were reported (=-=Sha and Pereira, 2003-=-; McDonald et al., 2005; Sun et al., 2008), all reporting around 94.3% F1 score. These systems use features composed of words, POS tags, and tags. More recently, Shen and Sarkar (2005) obtained 95.23%... |
580 | A maximum entropy model for part-of-speech tagging,” - Ratnaparkhi - 1996 |
556 | The Proposition Bank: An annotated corpus of semantic roles
- Palmer, Gildea, et al.
- 2005
(Show Context)
Citation Context ...ua.ac.be/conll2003/ner. 2464NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH 2.4 Semantic Role Labeling SRL aims at giving a semantic role to a syntactic constituent of a sentence. In the PropBank (=-=Palmer et al., 2005-=-) formalism one assigns roles ARG0-5 to words that are arguments of a verb (or more technically, a predicate) in the sentence, for example, the following sentence might be tagged “[John]ARG0 [ate]REL ... |
507 |
Three models for the description of language
- Chomsky
- 1956
(Show Context)
Citation Context ...n. At first glance, the ranking task appears unrelated to the induction of probabilistic grammars that underly standard parsing algorithms. The lack of hierarchical representation seems a fatal flaw (=-=Chomsky, 1956-=-). However, ranking is closely related to an alternative description of the language structure: operator grammars (Harris, 1968). Instead of directly studying the structure of a sentence, Harris defin... |
486 | Prediction and entropy of printed English - Shannon - 1951 |
447 | A neural probabilistic language model,” - Bengio, Ducharme, et al. - 2000 |
443 | A framework for learning predictive structures from multiple tasks and unlabeled data. - Ando, Zhang - 2005 |
435 | Phoneme Recognition Using Time-Delay Neural Networks", - Waibel, Hanazawa, et al. - 1989 |
409 | Learning to order things
- Cohen, Shapire, et al.
- 1999
(Show Context)
Citation Context ...earn syntax, rare but legal phrases are no less significant than common phrases. It is therefore desirable to define alternative training criteria. We propose here to use a pairwise ranking approach (=-=Cohen et al., 1998-=-). We seek a network that computes a higher score when given a legal phrase than when given an incorrect phrase. Because the ranking literature often deals with information retrieval applications, man... |
394 | Greedy layer-wise training of deep networks,” - Bengio, Lamblin, et al. - 2007 |
277 |
Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition,‖ Neurocomputing—Algorithms, Architectures and Applications,
- Bridle
- 1989
(Show Context)
Citation Context ... to the task of interest. To simplify the notation, we drop x from now, and we write instead [ ] fθ . This score can be i interpreted as a conditional tag probability p(i|x, θ) by applying a softmax (=-=Bridle, 1990-=-) operation over all the tags: Defining the log-add operation as p(i|x,θ)= e[ f θ] i ∑ j e [ f θ] j logaddzi = log(∑ i i we can express the log-likelihood for one training example (x,y) as follows: . ... |
267 | Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons.
- McCallum, Li
- 2003
(Show Context)
Citation Context ...d (13) using a linear model instead of a nonlinear neural network. CRFs have been widely used in the NLP world, such as for POS tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), NER (=-=McCallum and Li, 2003-=-) or SRL (Cohn and Blunsom, 2005). Compared to such CRFs, we take advantage of the nonlinear network to learn appropriate features for each task of interest. 10. In other words, read logadd as ⊕ and +... |
249 | Continuous speech recognition by statistical methods,”
- Jelinek
- 1976
(Show Context)
Citation Context ...raining. For instance, modern speech recognition systems use Bayes rule to combine the outputs of an acoustic model trained on speech data and a language model trained on phonetic or textual corpora (=-=Jelinek, 1976-=-). This joint decoding approach has been successfully applied to structurally more complex NLP tasks. Sutton and McCallum (2005b) obtain improved results by combining the predictions of independently ... |
232 | Word representations: a simple and general method for semi-supervised learning. Annual Meeting of the Association for Computational Linguistics.
- Turian, Ratinov, et al.
- 2010
(Show Context)
Citation Context ...ively small training set (RCV1, 37M words), unlikely to contain enough instances of the rare words. Secondly, they predict the correctness of the final word of each window instead of the center word (=-=Turian et al., 2010-=-), effectively restricting the model to unidirectional prediction. Finally, they do not fine tune their embeddings after unsupervised training. 22. Available at http://ml.nec-labs.com/senna. 23. Avail... |
227 |
Mathematical Structures of Language.
- Harris
- 1968
(Show Context)
Citation Context ...algorithms. The lack of hierarchical representation seems a fatal flaw (Chomsky, 1956). However, ranking is closely related to an alternative description of the language structure: operator grammars (=-=Harris, 1968-=-). Instead of directly studying the structure of a sentence, Harris defines an algebraic structure on the space of all sentences. Starting from a couple of elementary sentence forms, sentences are des... |
219 | Chunking with Support Vector Machines,
- Kudo, Matsumoto
- 2002
(Show Context)
Citation Context ...a window around the word of interest containing POS and words as features, as well as surrounding tags. They perform dynamic programming at test time. Later, they improved their results up to 93.91% (=-=Kudo and Matsumoto, 2001-=-) using an ensemble of classifiers trained with different tagging conventions (see Section 3.3.3). Since then, a certain number of systems based on second-order random fields were reported (Sha and Pe... |
215 | Efficient backprop.
- LeCun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...1991). This allows us to easily build variants of our networks. For details about gradient computations, see Appendix A. Remark 7 (Tricks) Many tricks have been reported for training neural networks (=-=LeCun et al., 1998-=-). Which ones to choose is often confusing. We employed only two of them: the initialization and update of the parameters of each network layer were done according to the “fan-in” of the layer, that i... |
208 | Dependency networks for inference, collaborative filtering, and data visualization. - Heckerman - 2000 |
179 | Simple semi-supervised dependency parsing.
- Koo, Carreras, et al.
- 2008
(Show Context)
Citation Context ...l information. The induced word representation has been used with success in a wide variety of NLP tasks, including POS (Schütze, 1995), NER (Miller et al., 2004; Ratinov and Roth, 2009), or parsing (=-=Koo et al., 2008-=-). Other related approaches exist, like phrase clustering (Lin and Wu, 2009) which has been shown to work well for NER. Finally, Huang and Yates (2009) have recently proposed a smoothed language model... |
176 | Shallow semantic parsing using support vector machines,” in HLT-NAACL, - Pradhan, Ward, et al. - 2004 |
171 | Dynamic Conditional Random Fields : Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. - Sutton, McCallum - 2004 |
160 | Contrastive estimation: Training log-linear models on unlabeled data. - Smith, Eisner - 2005 |
147 | Curriculum learning.
- Bengio, Louradour, et al.
- 2009
(Show Context)
Citation Context ...of networks using increasingly large dictionaries, each network being initialized with the embeddings of the previous network. Successive dictionary sizes and switching times are chosen arbitrarily. (=-=Bengio et al., 2009-=-) provides a more detailed discussion of this the (as yet, poorly understood) “curriculum” process. By analogy with biological cell lines, we have bred a few network lines. Within each line, child net... |
147 |
Learning internal representations by back-propagating errors,”
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...roblems (Bottou, 1991, 1998). Stochastic gradient iterations that hit a non-differentiability are simply skipped. Remark 6 (Modular Approach) The well known “back-propagation” algorithm (LeCun, 1985; =-=Rumelhart et al., 1986-=-) computes gradients using the chain rule. The chain rule can also be used in a modular implementation. 11 Our modules correspond to the boxes in Figure 1 and Figure 2. Given derivatives with respect ... |
142 | Design challenges and misconceptions in named entity recognition.
- Ratinov, Roth
- 2009
(Show Context)
Citation Context ...ters by maximizing the bigram’s mutual information. The induced word representation has been used with success in a wide variety of NLP tasks, including POS (Schütze, 1995), NER (Miller et al., 2004; =-=Ratinov and Roth, 2009-=-), or parsing (Koo et al., 2008). Other related approaches exist, like phrase clustering (Lin and Wu, 2009) which has been shown to work well for NER. Finally, Huang and Yates (2009) have recently pro... |
140 | Effective self-training for parsing. - McClosky, Charniak, et al. - 2006 |
121 | Three new graphical models for statistical language modelling. - Mnih, Hinton - 2007 |
116 | Named entity recognition through classifier combination.
- Florian, Ittycheriah, et al.
- 2003
(Show Context)
Citation Context ...0) stemmer and obtained the same performance as when using two character suffixes. 6.2 Gazetteers State-of-the-art NER systems often use a large dictionary containing well known named entities (e.g., =-=Florian et al., 2003-=-). We restricted ourselves to the gazetteer provided by the CoNLL challenge, containing 8,000 locations, person names, organizations, and miscellaneous entities. We trained a NER network with 4 additi... |
114 | Distributional part-of-speech tagging.
- Schutze
- 1995
(Show Context)
Citation Context ...992a), which builds hierarchical word clusters by maximizing the bigram’s mutual information. The induced word representation has been used with success in a wide variety of NLP tasks, including POS (=-=Schütze, 1995-=-), NER (Miller et al., 2004; Ratinov and Roth, 2009), or parsing (Koo et al., 2008). Other related approaches exist, like phrase clustering (Lin and Wu, 2009) which has been shown to work well for NER... |
113 | A novel use of statistical parsing to extract information from text - Miller, Ramshaw, et al. - 2000 |
110 | Use of support vector learning for chunk identification. - Kudoh, Matsumoto - 2000 |
93 | An estimation of an upper bound for the entropy of English,
- Brown, Pietra, et al.
- 1992
(Show Context)
Citation Context .... However, word representations are perhaps more commonly inferred from n-gram language modeling rather than purely continuous language models. One popular approach is the Brown clustering algorithm (=-=Brown et al., 1992-=-a), which builds hierarchical word clusters by maximizing the bigram’s mutual information. The induced word representation has been used with success in a wide variety of NLP tasks, including POS (Sch... |
88 | The necessity of parsing for predicate argument recognition
- Gildea, Palmer
- 2002
(Show Context)
Citation Context ...d still take advantage of even bigger unlabeled data sets. 4.6 Ranking and Language There is a large agreement in the NLP community that syntax is a necessary prerequisite for semantic role labeling (=-=Gildea and Palmer, 2002-=-). This is why state-of-the-art semantic role labeling systems thoroughly exploit multiple parse trees. The parsers themselves (Charniak, 2000; Collins, 1999) contain considerable prior information ab... |
86 | Name tagging with word clusters and discriminative training.
- Miller, Guinness, et al.
- 2004
(Show Context)
Citation Context ...ierarchical word clusters by maximizing the bigram’s mutual information. The induced word representation has been used with success in a wide variety of NLP tasks, including POS (Schütze, 1995), NER (=-=Miller et al., 2004-=-; Ratinov and Roth, 2009), or parsing (Koo et al., 2008). Other related approaches exist, like phrase clustering (Lin and Wu, 2009) which has been shown to work well for NER. Finally, Huang and Yates ... |
82 | Online algorithms and stochastic approximations - Bottou - 1998 |
81 | SVMTool: A general POS tagger generator based on support vector machines - Giménez, Màrquez - 2004 |
77 | Guided learning for bidirectional sequence classification. - Shen, Satta, et al. - 2007 |
69 | The necessity of syntactic parsing for semantic role labeling.
- Punyakanok, Roth, et al.
- 2005
(Show Context)
Citation Context .... (2003) describes a NER system whose inputs include POS and CHUNK tags, as well as the output of two other NER classifiers. State-of-the-art SRL systems exploit parse trees (Gildea and Palmer, 2002; =-=Punyakanok et al., 2005-=-), related to CHUNK tags, and built using POS tags (Charniak, 2000; Collins, 1999). Table 9 reports results obtained for the CHUNK and NER tasks by adding discrete word features (Section 3.1.1) repres... |
61 | Gradient based learning applied to document recognition,” - LeCun, Bottou, et al. - 1998 |
60 | A convergent gambling estimate of the entropy of English - Cover, King - 1978 |
58 | Named entity recognition with a maximum entropy approach - Chieu - 2003 |
49 | Stochastic gradient learning in neural networks.
- Bottou
- 1991
(Show Context)
Citation Context ...ach task of interest. 10. In other words, read logadd as ⊕ and + as ⊗. 2475COLLOBERT, WESTON, BOTTOU, KARLEN, KAVUKCUOGLU AND KUKSA 3.4.3 STOCHASTIC GRADIENT Maximizing (8) with stochastic gradient (=-=Bottou, 1991-=-) is achieved by iteratively selecting a random example(x, y) and making a gradient step: θ←− θ+λ ∂log p(y|x, θ) , (16) ∂θ where λ is a chosen learning rate. Our neural networks described in Figure 1 ... |
48 | Generalized inference with multiple semantic role labeling systems (shared task paper
- Koomen, Punyakanok, et al.
- 2005
(Show Context)
Citation Context ... not surprising to see many top CoNLL systems using external labeled data, like additional NER classifiers for the NER architecture of Florian et al. (2003) or additional parse trees for SRL systems (=-=Koomen et al., 2005-=-). Combining multiple systems or tweaking carefully features is also a common approach, like in the chunking top system (Shen and Sarkar, 2005). However, when comparing systems, we do not learn anythi... |
47 | Semi-supervised Learning for Natural Language," - Liang - 2005 |
47 | Distributional representations for handling sparsity in supervised sequence-labeling. - Huang, Yates - 2009 |
43 | Phrase clustering for discriminative learning.
- Lin, Wu
- 2009
(Show Context)
Citation Context ...n a wide variety of NLP tasks, including POS (Schütze, 1995), NER (Miller et al., 2004; Ratinov and Roth, 2009), or parsing (Koo et al., 2008). Other related approaches exist, like phrase clustering (=-=Lin and Wu, 2009-=-) which has been shown to work well for NER. Finally, Huang and Yates (2009) have recently proposed a smoothed language modeling approach based on a Hidden Markov Model, with success on POS and Chunki... |
41 | A framework for the cooperation of learning algorithms - Bottou, Gallinari - 1991 |
41 | The entropy of English using PPM-based models - Teahan, Cleary - 1996 |
39 | Flexible text segmentation with structured multilabel classification
- McDonald, Crammer, et al.
- 2005
(Show Context)
Citation Context ...semble of classifiers trained with different tagging conventions (see Section 3.3.3). Since then, a certain number of systems based on second-order random fields were reported (Sha and Pereira, 2003; =-=McDonald et al., 2005-=-; Sun et al., 2008), all reporting around 94.3% F1 score. These systems use features composed of words, POS tags, and tags. More recently, Shen and Sarkar (2005) obtained 95.23% using a voting classif... |
37 | Semantic role labelling with tree conditional random fields. In
- Cohn, Blunsom
- 2005
(Show Context)
Citation Context ...tead of a nonlinear neural network. CRFs have been widely used in the NLP world, such as for POS tagging (Lafferty et al., 2001), chunking (Sha and Pereira, 2003), NER (McCallum and Li, 2003) or SRL (=-=Cohn and Blunsom, 2005-=-). Compared to such CRFs, we take advantage of the nonlinear network to learn appropriate features for each task of interest. 10. In other words, read logadd as ⊕ and + as ⊗. 2475COLLOBERT, WESTON, B... |
37 | Connectionist language modeling for large vocabulary continuous speech recognition. - Schwenk, Gauvain - 2002 |
37 | Semi-supervised sequential labeling and segmentation using gigaword scale unlabeled data. - Suzuki, Isozaki - 2008 |
37 |
Bayesian Learning for Neural Networks. Number 118
- Neal
- 1996
(Show Context)
Citation Context ...best network. This improvement comes of course at the expense of a ten fold increase of the running time. On the other hand, multiple training times could be improved using smart sampling strategies (=-=Neal, 1996-=-). We can also observe that the performance variability is not very large. The local minima found by the training algorithm are usually good local minima, thanks to the oversized parameter space and t... |
34 | The BellKor solution to the Net Flix Prize.
- Bell, Koren, et al.
- 2007
(Show Context)
Citation Context ...e CoNLL challenge). We consistently obtain moderate improvements. 6.4 Ensembles Constructing ensembles of classifiers is a proven way to trade computational efficiency for generalization performance (=-=Bell et al., 2007-=-). Therefore it is not surprising that many NLP systems achieve state-of-the-art performance by combining the outputs of multiple classifiers. For instance, Kudo 2488NATURAL LANGUAGE PROCESSING (ALMO... |
33 | Natural language grammar induction using a constituentcontext model. - Klein, Manning - 2001 |
33 | Structure compilation: trading structure for features.
- Liang, Daume, et al.
- 2008
(Show Context)
Citation Context ...eems to boost the performance for the Chunking, NER and SRL tasks, with little advantage for POS. This result is in line with existing NLP studies comparing sentence-level and word-level likelihoods (=-=Liang et al., 2008-=-). The capacity of our network architectures lies mainly in the word lookup table, which contains 50×100,000 parameters to train. In the WSJ data, 15% of the most common words appear about 90% of the ... |
32 | Joint parsing and semantic role labeling. - Sutton, McCallum - 2005 |
32 | Transductive learning for statistical machine translation.
- Ueffing, Gholamreza, et al.
- 2007
(Show Context)
Citation Context ...vious semi-supervised approaches for NLP can be roughly categorized as follows: • Ad-hoc approaches such as (Rosenfeld and Feldman, 2007) for relation extraction. • Self-training approaches, such as (=-=Ueffing et al., 2007-=-) for machine translation, and (McClosky et al., 2006) for parsing. These methods augment the labeled training set with examples from the unlabeled dataset using the labels predicted by the model itse... |
32 |
A grammar of English on mathematical principles
- Harris
- 1982
(Show Context)
Citation Context ...me syntactical function and possibly the same meaning. This observation forms the empirical basis for the construction of operator grammars that describe real-world natural languages such as English (=-=Harris, 1982-=-). Therefore there are solid reasons to believe that the ranking criterion (18) has the conceptual potential to capture strong syntactical and semantical information. On the 21All the Guys other hand... |
31 | Ranking the best instances.
- Clemencon, Vayatis
- 2007
(Show Context)
Citation Context ...erature often deals with information retrieval applications, many authors define complex ranking criteria that give more weight to the ordering of the best ranking instances (see Burges et al., 2007; =-=Clémençon and Vayatis, 2007-=-). However, in our case, we do not want to emphasize the most common phrase over the rare but legal phrases. Therefore we use a simple pairwise criterion. Let f(s, θ) denote the score computed by our ... |
31 | A joint model for semantic role labeling.
- Haghighi, Toutanova, et al.
- 2005
(Show Context)
Citation Context ...rse trees computed using both the Charniak (2000) and Collins (1999) parsers. State-of-the-art systems often exploit additional parse trees such as the k top ranking parse trees (Koomen et al., 2005; =-=Haghighi et al., 2005-=-). In contrast our SRL networks so far do not use parse trees at all. They rely instead on internal representations transferred from a language model trained with an objective function that captures 2... |
28 | A: Composition of conditional random fields for transfer learning - Sutton, McCallum |
27 | Deep learning for efficient discriminative parsing.
- Collobert
- 2011
(Show Context)
Citation Context ... and using additional lookup tables of dimension 5 for each parse tree level. Table 12 reports the performance improvements obtained by providing increasing levels of parse 18. In a more recent work (=-=Collobert, 2011-=-), we propose an extension of this approach for the generation of full syntactic parse trees, using a recurrent version of our architecture. 2490NATURAL LANGUAGE PROCESSING (ALMOST) FROM SCRATCH Appr... |
27 | Learning sets of filters using back-propagation
- Plaut, Hinton
- 1987
(Show Context)
Citation Context ...f them: the initialization and update of the parameters of each network layer were done according to the “fan-in” of the layer, that is the number of inputs used to compute each output of this layer (=-=Plaut and Hinton, 1987-=-). The fan-in for the lookup table (1), the lth linear layer (4) and the convolution layer (6) are respectively 1, n l−1 hu and dwin× n l−1 hu . The initial parameters of the network were drawn from a... |
27 |
Symbolic-neural Systems and the Use of Hints for Developing Complex Systems
- Suddarth, Holden
- 1991
(Show Context)
Citation Context ...traightforward when the training sets for the individual tasks contain the same patterns with different labels. It is then sufficient to train a model that computes multiple outputs for each pattern (=-=Suddarth and Holden, 1991-=-). Using this scheme, Sutton et al. (2007) demonstrate improvements on POS tagging and noun-phrase chunking using jointly trained CRFs. However the joint labeling requirement is a limitation because s... |
23 | Semantic role chunking combining complementary syntactic views,” in CoNLL, - Pradhan, Hacioglu, et al. - 2005 |
18 |
A learning scheme for asymmetric threshold networks
- LeCun
- 1985
(Show Context)
Citation Context ...entiability problems (Bottou, 1991, 1998). Stochastic gradient iterations that hit a non-differentiability are simply skipped. Remark 6 (Modular Approach) The well known “back-propagation” algorithm (=-=LeCun, 1985-=-; Rumelhart et al., 1986) computes gradients using the chain rule. The chain rule can also be used in a modular implementation. 11 Our modules correspond to the boxes in Figure 1 and Figure 2. Given d... |
15 |
Semi-Supervised Learning. Adaptive Computation and Machine Learning series.
- CHAPELLE, SCHÃULKOPF, et al.
- 2006
(Show Context)
Citation Context ...tained using purely supervised training of the benchmark NLP tasks. 4.5 Semi-supervised Benchmark Results Semi-supervised learning has been the object of much attention during the last few years (see =-=Chapelle et al., 2006-=-). Previous semi-supervised approaches for NLP can be roughly categorized as follows: • Ad-hoc approaches such as Rosenfeld and Feldman (2007) for relation extraction. • Self-training approaches, such... |
15 | Using Corpus Statistics on Entities to Improve Semi-Supervised Relation Extraction from the Web.
- Rosenfeld, Feldman
- 2007
(Show Context)
Citation Context ...een the object of much attention during the last few years (see Chapelle et al., 2006). Previous semi-supervised approaches for NLP can be roughly categorized as follows: • Ad-hoc approaches such as (=-=Rosenfeld and Feldman, 2007-=-) for relation extraction. • Self-training approaches, such as (Ueffing et al., 2007) for machine translation, and (McClosky et al., 2006) for parsing. These methods augment the labeled training set w... |
14 | Comparing and combining finite-state and context-free parsers - Hollingshead, Fisher, et al. - 2005 |
11 | and Quoc Viet Le. Learning to rank with nonsmooth cost functions - Burges, Ragno - 2007 |
9 | Voting between multiple data representations for text chunking
- Shen, Sarkar
- 2005
(Show Context)
Citation Context ...an et al. (2003) or additional parse trees for SRL systems (Koomen et al., 2005). Combining multiple systems or tweaking carefully features is also a common approach, like in the chunking top system (=-=Shen and Sarkar, 2005-=-). However, when comparing systems, we do not learn anything of the quality of each system if they were trained with different labeled data. For that reason, we will refer to benchmark systems, that i... |
8 | A discriminative language model with pseudo-negative samples - Okanohara, Tsujii |
5 |
Large Scale Machine Learning
- Collobert
- 2004
(Show Context)
Citation Context ... the hyperbolic tangent as non-linearity. It has the advantage of being slightly cheaper to compute (compared to the exact hyperbolic tangent), while leaving the generalization performance unchanged (=-=Collobert, 2004-=-). The corresponding layer l applies a HardTanh over its input vector: [ ] [ ] f l θ = HardTanh( f i θ i ), where ⎧ ⎨ HardTanh(x)= ⎩ −1 if x<−1 x if − 1<= x<= 1 1 if x>1 . (5) Scoring. Finally, the ou... |
4 | Yoshua Bengio. Global training of document processing systems using graph transformer networks - Bottou, LeCun - 1997 |
4 | Modeling latentdynamic in shallow parsing: A latent conditional model with improved inference
- Sun, Morency, et al.
- 2008
(Show Context)
Citation Context ...rained with different tagging conventions (see Section 3.3.3). Since then, a certain number of systems based on second-order random fields were reported (Sha and Pereira, 2003; McDonald et al., 2005; =-=Sun et al., 2008-=-), all reporting around 94.3% F1 score. These systems use features composed of words, POS tags, and tags. More recently, Shen and Sarkar (2005) obtained 95.23% using a voting classifier scheme, where ... |
2 | Robust Parsing of the Proposition Bank. ROMAND 2006: Robust Methods in Analysis of Natural language Data - Musillo, Merlo - 2006 |
1 | Comparing and combining finite-state and context-free parsers - WESTON, KARLEN, et al. - 2005 |
1 | Shallow semantic parsing using support vector machines - WESTON, KARLEN, et al. - 2004 |
1 | Effective self-training for parsing. Proceedings of HLT-NAACL 2006, 2006. ar Xi v Natural Language Processing (almost) from Scratch - McClosky, Charniak, et al. |
1 | Robust Parsing of the Proposition Bank - Musillo, Merlo - 2006 |
1 | Natural Language Processing (almost) from Scratch - Florian, Ittycheriah, et al. - 2003 |
1 | Generalized inference with multiple semantic role labeling systems (shared task paper - Collobert, Bottou, et al. - 2005 |
1 | A maximum entropy model for part-of-speech tagging - Collobert, Bottou, et al. - 1996 |
1 | Natural Language Processing (almost) from Scratch - Sutton, McCallum |