## Discriminative syntactic language modeling for speech recognition (2005)

### Cached

### Download Links

Venue: | In Proc. of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL-05 |

Citations: | 25 - 3 self |

### BibTeX

@INPROCEEDINGS{Collins05discriminativesyntactic,

author = {Michael Collins and Brian Roark and Murat Saraclar},

title = {Discriminative syntactic language modeling for speech recognition},

booktitle = {In Proc. of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL-05},

year = {2005},

pages = {507--514}

}

### OpenURL

### Abstract

We describe a method for discriminative training of a language model that makes use of syntactic features. We follow a reranking approach, where a baseline recogniser is used to produce 1000-best output for each acoustic input, and a second “reranking ” model is then used to choose an utterance from these 1000-best lists. The reranking model makes use of syntactic features together with a parameter estimation method that is based on the perceptron algorithm. We describe experiments on the Switchboard speech recognition task. The syntactic features provide an additional 0.3 % reduction in test–set error rate beyond the model of (Roark et al., 2004a; Roark et al., 2004b) (significant at p < 0.001), which makes use of a discriminatively trained n-gram model, giving a total reduction of 1.2 % over the baseline Switchboard system. 1

### Citations

2310 | Conditional random fields: probabilistic models for segmenting and labeling sequence data - Lafferty, McCallum, et al. - 2001 |

953 | Head-Driven Statistical Models for Natural Language Parsing - Collins - 1999 |

851 | An Empirical Study of Smoothing Techniques for Language Modeling
- Chen, Goodman
- 1998
(Show Context)
Citation Context ...del, a Markov assumption is made, namely that each word depends only on the previous (n − 1) words. The parameters of the language model are usually estimated from a large quantity of text data. See (=-=Chen and Goodman, 1998-=-) for an overview of estimation techniques for n-gram models. This paper describes a method for incorporating syntactic features into the language model, using discriminative parameter estimation tech... |

187 | An E cient Probabilistic Context-Free Parsing Algorithm that Computes Pre x Probabilities
- Stolcke
- 1995
(Show Context)
Citation Context ...ree grammars for language modeling have been explored for more than a decade. Early approaches included algorithms for efficiently calculating string prefix probabilities (Jelinek and Lafferty, 1991; =-=Stolcke, 1995-=-) and approaches to exploit such algorithms to produce n-gram models (Stolcke and Segal, 1994; Jurafsky et al., 1995). The work of Chelba and Jelinek (Chelba and Jelinek, 1998; Chelba and Jelinek, 200... |

148 | Estimators for stochastic “unification-based” grammars - Johnson, Geman, et al. - 1999 |

123 | Exploiting syntactic structure for language modeling
- Chelba, Jelinek
- 1998
(Show Context)
Citation Context ...m, where the syntactic language model has the task of modeling a distribution over strings in the language, in a very similar way to traditional n-gram language models. The Structured Language Model (=-=Chelba and Jelinek, 1998-=-; Chelba and Jelinek, 2000; Chelba, 2000; Xu et al., 2002; Xu et al., 2003) makes use of an incremental shift-reduce parser to enable the probability of words to be conditioned on k previous c-command... |

121 | A smorgasbord of features for statistical machine translation - Och, Gildea, et al. - 2004 |

87 | Immediate-head parsing for language models
- Charniak
(Show Context)
Citation Context ... be conditioned on k previous c-commanding lexical heads, rather than simply on the previous k words. Incremental topdown and left-corner parsing (Roark, 2001a; Roark, 2001b) and head-driven parsing (=-=Charniak, 2001-=-) approaches have directly used generative PCFG models as language models. In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; Wang, 2003; Wang et al., 2004), a constraint dependency gramm... |

76 | Computation of the probability of initial substring generation by stochastic context free grammars
- Jelinek, Lafferty
- 1991
(Show Context)
Citation Context ...loiting stochastic context-free grammars for language modeling have been explored for more than a decade. Early approaches included algorithms for efficiently calculating string prefix probabilities (=-=Jelinek and Lafferty, 1991-=-; Stolcke, 1995) and approaches to exploit such algorithms to produce n-gram models (Stolcke and Segal, 1994; Jurafsky et al., 1995). The work of Chelba and Jelinek (Chelba and Jelinek, 1998; Chelba a... |

61 | Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm - Roark, Saraclar, et al. - 2004 |

53 | Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods
- Collins
- 2004
(Show Context)
Citation Context ...s for this domain, suggesting that the 1000-best approximation is a reasonable one.sproduct, between vectors x and y). For this paper, we train the parameter vector ¯α using the perceptron algorithm (=-=Collins, 2004-=-; Collins, 2002). The perceptron algorithm is a very fast training method, in practice requiring only a few passes over the training set, allowing for a detailed comparison of a wide variety of featur... |

42 |
Structured language modeling, Computer Speech and Language 14(4):283–332
- Chelba, Jelinek
- 2000
(Show Context)
Citation Context ...guage model has the task of modeling a distribution over strings in the language, in a very similar way to traditional n-gram language models. The Structured Language Model (Chelba and Jelinek, 1998; =-=Chelba and Jelinek, 2000-=-; Chelba, 2000; Xu et al., 2002; Xu et al., 2003) makes use of an incremental shift-reduce parser to enable the probability of words to be conditioned on k previous c-commanding lexical heads, rather ... |

39 | Precise N-gram probabilities from stochastic context-free grammars
- STOLCKE, SEGAL
- 1994
(Show Context)
Citation Context ... approaches included algorithms for efficiently calculating string prefix probabilities (Jelinek and Lafferty, 1991; Stolcke, 1995) and approaches to exploit such algorithms to produce n-gram models (=-=Stolcke and Segal, 1994-=-; Jurafsky et al., 1995). The work of Chelba and Jelinek (Chelba and Jelinek, 1998; Chelba and Jelinek, 2000; Chelba, 2000) involved the use of a shift-reduce parser trained on Penn treebank style ann... |

33 | Using a Stochastic Context-Free Grammar as a Language Model for Speech Recognition
- Jurafsky, Wooters, et al.
- 1995
(Show Context)
Citation Context ...rithms for efficiently calculating string prefix probabilities (Jelinek and Lafferty, 1991; Stolcke, 1995) and approaches to exploit such algorithms to produce n-gram models (Stolcke and Segal, 1994; =-=Jurafsky et al., 1995-=-). The work of Chelba and Jelinek (Chelba and Jelinek, 1998; Chelba and Jelinek, 2000; Chelba, 2000) involved the use of a shift-reduce parser trained on Penn treebank style annotations, that maintain... |

26 | The use of a linguistically motivated language model in conversational speech recognition
- Wang, Stolcke, et al.
- 2004
(Show Context)
Citation Context ...k, 2001b) and head-driven parsing (Charniak, 2001) approaches have directly used generative PCFG models as language models. In the work of Wen Wang and Mary Harper (Wang and Harper, 2002; Wang, 2003; =-=Wang et al., 2004-=-), a constraint dependency grammar and a finite-state tagging model derived from that grammar were used to exploit syntactic dependencies. Our approach differs from previous work in a couple of import... |

26 | A study on richer syntactic dependencies for structured language modeling
- Xu, Chelba, et al.
- 2001
(Show Context)
Citation Context ...istribution over strings in the language, in a very similar way to traditional n-gram language models. The Structured Language Model (Chelba and Jelinek, 1998; Chelba and Jelinek, 2000; Chelba, 2000; =-=Xu et al., 2002-=-; Xu et al., 2003) makes use of an incremental shift-reduce parser to enable the probability of words to be conditioned on k previous c-commanding lexical heads, rather than simply on the previous k w... |

16 | Whole-‐Sentence Exponential Language Models: a Vehicle for Linguistic-‐Statistical Integration.” Computers Speech and Language
- Rosenfeld, Chen, et al.
(Show Context)
Citation Context ...features. The first approach that we follow is to map each parse tree to sequences encoding part-of-speech (POS) decisions, and “shallow” parsing decisions. Similar representations have been used by (=-=Rosenfeld et al., 2001-=-; Wang and Harper, 2002). Figure 3 shows the sequential representations that we used. The first simply makes use of the POS tags for each word. The latter representations make use of sequences of non-... |

5 |
Probabilistic top-down parsing and language modeling
- 2001a
(Show Context)
Citation Context ...ng or tagging model, and modifying a generative model is a rather indirect way of changing the features used by a model. In this respect, our approach is similar to that advocated in Rosenfeld et al. =-=(2001)-=-, which used Maximum Entropy modeling to allow for the use of shallow syntactic features for language modeling. A second contrast between our work and previous work, including that of Rosenfeld et al.... |

3 |
Robust Probabilistic Predictive Syntactic Processing
- 2001b
(Show Context)
Citation Context ...ng or tagging model, and modifying a generative model is a rather indirect way of changing the features used by a model. In this respect, our approach is similar to that advocated in Rosenfeld et al. =-=(2001)-=-, which used Maximum Entropy modeling to allow for the use of shallow syntactic features for language modeling. A second contrast between our work and previous work, including that of Rosenfeld et al.... |

2 |
Corrective language modeling for large vocabulary asr with the perceptron algorithm
- 2004a
(Show Context)
Citation Context ...ncoding of the model has allowed for the use of probabilities calculated off-line from this model to be used in the first pass of decoding, which has provided additional benefits. Finally, Och et al. =-=(2004)-=- use a reranking approach with syntactic information within a machine translation system. Rosenfeld et al. (2001) investigated the use of syntactic features in a Maximum Entropy approach. In their pap... |