## Online Adaptive Learning for Speech Recognition Decoding

### Cached

### Download Links

Citations: | 2 - 0 self |

### BibTeX

@MISC{Bilmes_onlineadaptive,

author = {Jeff Bilmes and Hui Lin},

title = {Online Adaptive Learning for Speech Recognition Decoding},

year = {}

}

### OpenURL

### Abstract

We describe a new method for pruning in dynamic models based on running an adaptive filtering algorithm online during decoding to predict aspects of the scores in the near future. These predictions are used to make well-informed pruning decisions during model expansion. We apply this idea to the case of dynamic graphical models and test it on a speech recognition database derived from Switchboard. Results show that significant (approximately factor of 2) speedups can be obtained without any increase in word error rate or increase in memory usage. Index Terms: graphical models, decoding, speech recognition, online learning

### Citations

9119 | Introduction to algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...o the efficiency and quality of any ASR system. There have been many methods proposed for ASR decoding in the past all of which are based, in one way or another, on the concept of dynamic programming =-=[1]-=-. These methods moreover end up being special cases of the junction tree algorithm in graphical models [2]. In fact, [3] was one of the first to show that belief propagation in graphical models was th... |

2534 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...HMM in such a way that the state space is factored. This is the case for dynamic graphical models [8] which includes dynamic Bayesian networks (DBNs) [9], and factored-state conditional random fields =-=[10]-=-. While factored state representations help in that they reduce the pressure on the amount of training data needed to produce robustly estimated parameters (the factorization properties act as a form ... |

986 |
An Introduction to Bayesian Networks
- Jensen
- 1996
(Show Context)
Citation Context ...the past all of which are based, in one way or another, on the concept of dynamic programming [1]. These methods moreover end up being special cases of the junction tree algorithm in graphical models =-=[2]-=-. In fact, [3] was one of the first to show that belief propagation in graphical models was the same process as the standard forward-backward algorithm for hidden Markov models (HMM). Most ASR systems... |

608 | Dynamic Bayesian Networks: Representation, Inference and
- MURPHY
- 2002
(Show Context)
Citation Context ... the type of model change that we would hope would reduce computational demands. In the dynamic Bayesian network literature, there have been a number of ways to help reduce the effect of entanglement =-=[12, 13, 14]-=-. An alternative approximate inference procedure is to extend the beam-search methods used in speech recognition to the case of dynamic graphical models. Such methods are critical, as it is often the ... |

318 |
Algorithmic aspects of vertex elimination on graphs
- Rose, Tarjan, et al.
- 1976
(Show Context)
Citation Context ... training data needed to produce robustly estimated parameters (the factorization properties act as a form of regularizer), due to the “entanglement problem” (which is a consequence of Rose’s theorem =-=[11]-=-), the inherent state space is often not (significantly) reduced, which is something of a disappointment since the “structuring” of the state space via factorization is exactly the type of model chang... |

315 |
Spoken Language Processing: A Guide to Theory, Algorithm and System Development
- Huang, Acero, et al.
- 2001
(Show Context)
Citation Context ...ilistic inference. In the dynamic case, however, the junction tree has a shape that looks something akin to Figure 1. 3. Search Methods in Speech Recognition Search methods in ASR have a long history =-=[7, 18, 19]-=-. The methods can for the most part be broken down into one of two approaches: synchronous vs. asynchronous. Figure 2: Left: asynchronous search. Right: synchronous search. Green depicts partially exp... |

174 | Probabilistic independence networks for hidden markov probability models
- Smyth, Heckerman, et al.
- 1997
(Show Context)
Citation Context ...f which are based, in one way or another, on the concept of dynamic programming [1]. These methods moreover end up being special cases of the junction tree algorithm in graphical models [2]. In fact, =-=[3]-=- was one of the first to show that belief propagation in graphical models was the same process as the standard forward-backward algorithm for hidden Markov models (HMM). Most ASR systems are based on,... |

111 | The graphical models toolkit: An open source software system for speech and timeseries processing
- Bilmes, Zweig
- 2002
(Show Context)
Citation Context ...T . Often, the template G is partitioned into three sections, an (optional) prologue G p = (Vp, Ep), a chunk G c = (Vc, Ec) (which is to be repeated in time), and an (optimal) epilogue G e = (Ve, Ee) =-=[17]-=-. Given a value T , an “unrolling” of the template is an instantiation where G p appears once (on the left), G c appears T + 1 times arranged in succession, and G e appears once on the right. Unlike s... |

83 |
A Review of Large-Vocabulary Continuous-Speech Recognition
- Young
- 1996
(Show Context)
Citation Context ...ilistic inference. In the dynamic case, however, the junction tree has a shape that looks something akin to Figure 1. 3. Search Methods in Speech Recognition Search methods in ASR have a long history =-=[7, 18, 19]-=-. The methods can for the most part be broken down into one of two approaches: synchronous vs. asynchronous. Figure 2: Left: asynchronous search. Right: synchronous search. Green depicts partially exp... |

58 |
Improvements in beam search for 10000-word continuous speech recognition
- Ney, Haeb-Umbach, et al.
- 1992
(Show Context)
Citation Context ...R literature have been of significant help to the process of decoding large vocabulary speech recognition systems using a practical amount of computing demands. In particular, beam-pruning approaches =-=[15, 7, 16]-=- that are common in ASR are extended to the case of dynamic graphical models and are evaluated on a standard graphical-model based speech recognition system. In addition, a new beam pruning approach i... |

54 | Dynamic programming search for continuous speech recognition
- Ney, Ortmanns
- 1999
(Show Context)
Citation Context ...nt ways of structuring the dynamic programming algorithm when it extends over a temporal signal such as speech: stack (or time-asynchronous) decoding [6], and “Viterbi” (or time-synchronous) decoding =-=[7]-=-. An alternative model for speech recognition extends the HMM in such a way that the state space is factored. This is the case for dynamic graphical models [8] which includes dynamic Bayesian networks... |

51 | An efficient A∗ stack decoder algorithm for continuous speech recognition with a stochastic language model
- Paul
- 1991
(Show Context)
Citation Context ... decoding choices, which correspond to two different ways of structuring the dynamic programming algorithm when it extends over a temporal signal such as speech: stack (or time-asynchronous) decoding =-=[6]-=-, and “Viterbi” (or time-synchronous) decoding [7]. An alternative model for speech recognition extends the HMM in such a way that the state space is factored. This is the case for dynamic graphical m... |

43 |
Graphical model architectures for speech recognition
- Bilmes, Bartels
- 2005
(Show Context)
Citation Context ...“Viterbi” (or time-synchronous) decoding [7]. An alternative model for speech recognition extends the HMM in such a way that the state space is factored. This is the case for dynamic graphical models =-=[8]-=- which includes dynamic Bayesian networks (DBNs) [9], and factored-state conditional random fields [10]. While factored state representations help in that they reduce the pressure on the amount of tra... |

36 |
An overview of decoding techniques for large vocabulary continuous speech recognition
- Aubert
- 2002
(Show Context)
Citation Context ...have been developed for, the hidden Markov model. In that realm, a variety of techniques have been developed that can produce decoders on real-world systems that are fast enough to be quite practical =-=[4, 5]-=-. Loosely speaking, there have been two decoding choices, which correspond to two different ways of structuring the dynamic programming algorithm when it extends over a temporal signal such as speech:... |

30 | Discovering the hidden structure of complex dynamic systems
- Boyen, Friedman, et al.
- 1999
(Show Context)
Citation Context ... the type of model change that we would hope would reduce computational demands. In the dynamic Bayesian network literature, there have been a number of ways to help reduce the effect of entanglement =-=[12, 13, 14]-=-. An alternative approximate inference procedure is to extend the beam-search methods used in speech recognition to the case of dynamic graphical models. Such methods are critical, as it is often the ... |

30 | Look-ahead techniques for fast beam search
- Ortmanns, Ney
- 2000
(Show Context)
Citation Context ...R literature have been of significant help to the process of decoding large vocabulary speech recognition systems using a practical amount of computing demands. In particular, beam-pruning approaches =-=[15, 7, 16]-=- that are common in ASR are extended to the case of dynamic graphical models and are evaluated on a standard graphical-model based speech recognition system. In addition, a new beam pruning approach i... |

27 |
Optimal and Adaptive Signal Processing
- Clarkson
- 1993
(Show Context)
Citation Context ...e both very fast to compute and accurate. There is likely a tradeoff in that accuracy could be improved by spending more time in estimation. This procedure is the essence of online adaptive filtering =-=[23]-=-, and it includes methods such as the well-known LMS and RLS algorithms, both linear models which are quite fast to compute and update but are often quite accurate. We therefore use these algorithms f... |

25 | Dynamic graphical models
- Bilmes
- 2010
(Show Context)
Citation Context ... the type of model change that we would hope would reduce computational demands. In the dynamic Bayesian network literature, there have been a number of ways to help reduce the effect of entanglement =-=[12, 13, 14]-=-. An alternative approximate inference procedure is to extend the beam-search methods used in speech recognition to the case of dynamic graphical models. Such methods are critical, as it is often the ... |

21 |
Improvements in beam search
- Steinbiss, Tran, et al.
- 1994
(Show Context)
Citation Context ...]. One of the advantages of synchronous search is that beam-pruning is quite simple. Assuming that Mt is the maximum score of a set of states at time t, two simple widely used beam pruning strategies =-=[21, 16]-=- are as follows: with beam pruning, we remove all partial hypotheses that are some fraction below Mt, and with K-state pruning, (sometimes called histogram pruning) only the top K states are allowed t... |

21 |
A.: Sparse forward-backward using minimum divergence beams for fast training of conditional random fields
- Pal, Sutton, et al.
(Show Context)
Citation Context ... compared the memory and time usage performances of various pruning methods including beam pruning, K-state pruning, percentage pruning, probability mass pruning (i.e. minimum divergence beam pruning =-=[25]-=-), and our proposed predictive pruning. Beam pruning and K-state pruning have been described in Section 3. Percentage pruning is a pruning method that retains top n% of clique entries with highest sco... |

19 | Switchboard 1: Small vocabulary tasks from switchboard 1
- King, Bartels, et al.
- 2005
(Show Context)
Citation Context ...ate. We therefore use these algorithms for doing our clique expansion pruning in our experimental section below. 5. Experiments We evaluate our approach on the 500 word task of the SVitchboard corpus =-=[24]-=-. SVitchboard is a subset of Switchboard-I that has been chosen to give a small and closed vocabulary. Standard procedures were used to train state-clustered withinword triphone models. A DBN with tri... |

17 | Anatomy of an extremely fast LVCSR decoder
- Saon, Zweig, et al.
- 2005
(Show Context)
Citation Context ...have been developed for, the hidden Markov model. In that realm, a variety of techniques have been developed that can produce decoders on real-world systems that are fast enough to be quite practical =-=[4, 5]-=-. Loosely speaking, there have been two decoding choices, which correspond to two different ways of structuring the dynamic programming algorithm when it extends over a temporal signal such as speech:... |

9 | Solving influence diagrams using HUGIN, Shafer-Shenoy and Lazy propagation
- Madsen, Nilsson
(Show Context)
Citation Context ...a structure. In fact, this is something that a junction tree could easily do. Normally in a junction tree, the form of message passing is based either on the Hugin or the Shenoy-Schafer architectures =-=[22]-=-. Each of these approaches, however, assume that the cost of a single message is itself tractable, which is not the case in speech recognition due to the very large number of possible random variable ... |

6 |
Probabilistic temporal reasoning,” AAAI
- Dean, Kanazawa
- 1988
(Show Context)
Citation Context ...ernative model for speech recognition extends the HMM in such a way that the state space is factored. This is the case for dynamic graphical models [8] which includes dynamic Bayesian networks (DBNs) =-=[9]-=-, and factored-state conditional random fields [10]. While factored state representations help in that they reduce the pressure on the amount of training data needed to produce robustly estimated para... |

2 |
Fast match techniques,” in Automatic Speech and Speaker Recognition (Advanced Topics
- Gopalakrishnan, Bahl
- 1996
(Show Context)
Citation Context ...e continuation heuristic is optimistic (i.e., it is an upper bound of the true score if the scores are probabilities), then the continuation heuristic is admissible and we have an A*-search procedure =-=[6, 20]-=-. Both synchronous and asynchronous search procedures have been used for speech recognition in the past, and the current prevailing wisdom is that synchronous procedures have for the most part bested ... |