## Speech Recognition using Neural Networks (1995)

Citations: | 29 - 0 self |

### BibTeX

@TECHREPORT{Tebelskis95speechrecognition,

author = {Joe Tebelskis},

title = {Speech Recognition using Neural Networks},

institution = {},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

This thesis examines how artificial neural networks can benefit a large vocabulary, speaker independent, continuous speech recognition system. Currently, most speech recognition systems are based on hidden Markov models (HMMs), a statistical framework that supports both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs make a number of suboptimal modeling assumptions that limit their potential effectiveness. Neural networks avoid many of these assumptions, while they can also learn complex functions, generalize effectively, tolerate noise, and support parallelism. While neural networks can readily be applied to acoustic modeling, it is not yet clear how they can be used for temporal modeling. Therefore, we explore a class of systems called NN-HMM hybrids, in which neural networks perform acoustic modeling, and HMMs perform temporal modeling. We argue that a NN-HMM hybrid has several theoretical advantages over a pure HMM system, including better acoustic ...

### Citations

4288 | A tutorial on hidden markov models and selected applications in speech recognition
- Rabiner
- 1989
(Show Context)
Citation Context ...) Unfortunately, this equation cannot be solved by either direct analysis or reestimation; the only known way to solve it is by gradient descent, and the proper implementation is complex (Brown 1987, =-=Rabiner 1989-=-). We note in passing that MMI is equivalent to using a Maximum A Posteriori (MAP) criterion, in which the expression to be maximized is P(M c |Y), rather than P(Y|M c ). To see this, note that accord... |

2031 |
Principal Component Analysis
- Jolliffe
- 2002
(Show Context)
Citation Context ...onality reduction can be performed by a statistical technique called Principal Components Analysis (PCA), which finds a set of M orthogonal vectors that account for the greatest variance in the data (=-=Jolliffe 1986-=-). Dimensionality reduction can also be performed by many types of neural networks. For example, a single layer perceptron, trained by an unsupervised competitive learning rule called Sanger's Rule (E... |

1785 |
Introduction to the Theory of Neural Computation
- Hertz, Krogh, et al.
(Show Context)
Citation Context ...networks that are trained with competitive learning. In fact, k-means clustering is exactly equivalent to the standard competitive learning rule, as given in Equation (38), when using batch updating (=-=Hertz et al 1991-=-). When analyzing high-dimensional data, it is often desirable to reduce its dimensionality, i.e., to project it into a lower-dimensional space while preserving as much information as possible. Dimens... |

1586 | Neural networks and physical systems with emergent collective computational abilities - Hopfield - 1982 |

1545 | Finding Structure in Time
- Elman
- 1990
(Show Context)
Citation Context ...types of recurrent networks have a layered structure with connections that feed back to earlier layers. Figure 3.9 shows two examples, known as the Jordan network (Jordan 1986) and the Elman network (=-=Elman 1990-=-). These networks feature a set of context units, whose activations are copied from either the outputs or the hidden units, respectively, and which are then fed forward into the hidden layer, suppleme... |

1495 | Fundamentals of Speech Recognition - Rabiner, JuangBH - 1993 |

968 |
The organization of behavior
- Hebb
- 1949
(Show Context)
Citation Context ...inding the optimal learning rate; it is usually optimized empirically, by just trying different values. Most training procedures, including Equation (30), are essentially variations of the Hebb Rule (=-=Hebb 1949), w-=-hich reinforces the connection between two units if their output activations are correlated: (31) w ji y i p t j p y p 2 ------------- p �� = w ji D ey i y j = 3. Review of Neural Networks 36 By r... |

846 |
Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems
- Cybenko
- 1989
(Show Context)
Citation Context ...le the SLP obtained only 58% accuracy. Thus, a hidden layer is clearly useful for speech recognition. We did not evaluate architectures with more than one hidden layer, because: 1. It has been shown (=-=Cybenko 1989-=-) that any function that can be computed by an MLP with multiple hidden layers can be computed by an MLP with just a single hidden layer, if it has enough hidden units; and 2. Experience has shown tha... |

775 | Adaptive mixtures of local experts - Jacobs, Jordan, et al. - 1991 |

722 | A Logical Calculus of the Ideas Immanent in Nervous Activity. The bulletin of mathematical biophysics - McCulloch, Pitts - 1943 |

658 | The Cascade-Correlation Learning Architecture
- Fahlman, Lebiere
- 1991
(Show Context)
Citation Context ...second-derivative information. w ji D e L j j�� , ( ) y i w ji -- ( ) �� �� = L j j�� , ( ) j j�� j�� 3. Review of Neural Networks 44 Among constructive algorithms, the Cascade=-= Correlation algorithm (Fahlman and Lebiere 1990-=-) is one of the most popular and effective. This algorithm starts with no hidden units, but gradually adds them (in depth-first fashion) as long as they help to cut down any remaining output error. At... |

591 |
Perceptual linear predictive (plp) analysis of speech
- Hermansky
- 1990
(Show Context)
Citation Context ...der differences of these 13 values. PLP coefficients are the cepstral coefficients of an autoregressive all-pole model of a spectrum that has been specially enhanced to emphasize perceptual features (=-=Hermansky 1990-=-). These coefficients are uncorrelated, so they cannot be interpreted visually. All of these coefficients lie in the range [0,1], except for the PLP-26 coefficients, which had irregular ranges varying... |

574 | Connectionsim and cognitive architecture: A critical analysis
- Fodor, Pylyshyn
- 1988
(Show Context)
Citation Context ... sort of variables, binding, modularity, and rules --- is clearly required in any system that claims to support natural language processing (Pinker and Prince 1988), not to mention general cognition (=-=Fodor and Pylyshyn 1988-=-). Unfortunately, it has proven very difficult to model compositionality within the pure connectionist framework, although a number of researchers have achieved some early, limited success along these... |

461 |
Connectionist speech recognition: a hybrid approach
- Bourlard, Morgan
- 1994
(Show Context)
Citation Context ...etworks estimate posterior probabilities which should be divided by priors in order to yield likelihoods for use in an HMM. Subsequent work at ICSI and SRI (Morgan & Bourlard 1990, Renals et al 1992, =-=Bourlard & Morgan 1994-=-) confirmed this insight in a series of experiments leading to excellent results on the Resource Management database. The simple MLPs in these experiments typically used an input window of 9 speech fr... |

457 | Parallel networks that learn to pronounce English text
- Sejnowski, Rosenberg
- 1987
(Show Context)
Citation Context ...erved error in the output unit activations (relative to desired outputs). To date, there have been many successful applications of neural networks trained by backpropagation. For instance: . NETtalk (=-=Sejnowski and Rosenberg, 1987-=-) is a neural network that learns how to pronounce English text. Its input is a window of 7 characters (orthographic text symbols), scanning a larger text buffer, and its output is a phoneme code (rel... |

435 | Dynamic programming algorithm optimization for spoken word recognition
- SAKOE
- 1978
(Show Context)
Citation Context ...2. Dynamic Time Warping In this section we motivate and explain the Dynamic Time Warping algorithm, one of the oldest and most important algorithms in speech recognition (Vintsyuk 1971, Itakura 1975, =-=Sakoe and Chiba 1978). The sim-=-plest way to recognize an isolated word sample is to compare it against a number of stored word templates and determine which is the "best match". This goal is complicated by a number of fac... |

433 | A learning algorithm for Boltzmann machines
- Ackley, Hinton, et al.
- 1985
(Show Context)
Citation Context ...hat the network always settles to a stable state (although it may reach a local minimum corresponding to a spurious memory arising from interference between the stored memories). A Boltzmann Machine (=-=Ackley et al 1985-=-) is a Hopfield network with hidden units, stochastic activations, and simulated annealing in its learning procedure. Each of these features contributes to its exceptional power. The hidden units allo... |

420 | Optimal brain damage - LeCun, Denker, et al. - 1990 |

339 | Connectionist Learning Procedures
- Hinton
(Show Context)
Citation Context ...p is the pattern index. In general, networks are nonlinear and multilayered, and their weights can be trained only by an iterative procedure, such as gradient descent on a global performance measure (=-=Hinton 1989-=-). This requires multiple passes of training on the entire training set (rather like a person learning a new skill); each pass is called an iteration or an epoch. Moreover, since the accumulated knowl... |

288 |
Principles of Neurodynamics
- Rosenblatt
- 1962
(Show Context)
Citation Context ...nits with a single layer of weights, the Delta Rule is known as the Perceptron Learning Rule, and it is guaranteed to find a set of weights representing a perfect solution, if such a solution exists (=-=Rosenblatt 1962-=-). In the context of multilayered networks, the Delta Rule is the basis for the backpropagation training procedure, which will be discussed in greater detail in Section 3.4. Yet another variation of t... |

276 | Hidden Markov models for speech recognition
- Huang, Ariki, et al.
- 1990
(Show Context)
Citation Context ...ed by cross-word modeling and multiple pronunciations per word. . Decipher: The full context-dependent version of SRI's Decipher system (Renals et al 1992). . Sphinx-II: The latest version of Sphinx (=-=Hwang and Huang 1993-=-), which includes senone modeling. The first five systems use context independent phoneme models, therefore they have relatively few parameters, and get only moderate word accuracy (84% to 91%). The l... |

268 | Lippmann R.P, Neural Network Classifiers Estimate Bayesian a-posterioi Probabilities - Richard - 1991 |

256 |
the PDP Research Group
- McClelland, Rumelhart
- 1986
(Show Context)
Citation Context ...of triphones). We could have represented the 40 possible values of the adjacent phoneme using 40 contextual inputs, but instead we clustered the phonemes by their linguistic features, as proposed by (=-=Rumelhart & McClelland 1986-=-: chapter 18), so that only 10 contextual inputs were necessary. Each phoneme was coded along four dimensions. The first dimension (three bits) was used to divide the phonemes into interrupted consona... |

247 |
Temporal Credit Assignment in Reinforcement Learning," Doctoral Dissertation
- Sutton
- 1984
(Show Context)
Citation Context ...network has a distinct error signal coming from the auxiliary network. A similar approach, which applies only to dynamic environments, is to enhance the auxiliary network so that it becomes a critic (=-=Sutton 1984-=-), which maps environmental data plus the reinforcement signal to a prediction of the future reinforcement signal. By comparing the expected and actual reinforcement signal, we can determine whether t... |

227 | An empirical study of learning speed in back-propagation networks
- Fahlman
- 1988
(Show Context)
Citation Context ...nd more powerful heuristic is to use second-derivative information to estimate how far down the hillside to travel; this is used in techniques such as conjugate gradient (Barnard 1992) and quickprop (=-=Fahlman 1988-=-). Ordinarily the weights are updated after each training pattern (this is called online training. But sometimes it is more effective to update the weights only after accumulating the gradients over a... |

221 | Optimal Unsupervised Learning in a Single-layer Linear Feedforward Neural Network
- Sanger
- 1989
(Show Context)
Citation Context ...tandard Hebb Rule would cause the weights to grow without bounds, hence this rule must be modified to prevent the weights from growing too large. One of several viable modifications is Sanger's Rule (=-=Sanger 1989-=-): (37) This can be viewed as a form of weight decay (Krogh and Hertz, 1992). This rule uses nonlocal information, but it has the nice property that the M weight vectors w j converge to the first M pr... |

204 | The ART of Adaptive Pattern Recognition by a Self-Organising Neural Network - Carpenter, Grossberg - 1988 |

192 | From basic network principles to neural architecture: emergence of orientation-selective cells - Linsker - 1986 |

176 |
Serial order: A parallel distributed processing approach,” Institute for Cognitive Science Report 8694
- Jordan
- 1986
(Show Context)
Citation Context ...lly applied to many problems. Other types of recurrent networks have a layered structure with connections that feed back to earlier layers. Figure 3.9 shows two examples, known as the Jordan network (=-=Jordan 1986-=-) and the Elman network (Elman 1990). These networks feature a set of context units, whose activations are copied from either the outputs or the hidden units, respectively, and which are then fed forw... |

172 | A time-delay neural network architecture for isolated word recognition - Lang, Waibel, et al. - 1990 |

171 | Second order derivatives for network pruning: Optimal brain surgeon
- Hassibi, Stork
- 1992
(Show Context)
Citation Context ...seful elements in the network. One straightforward technique is to delete the weights with the smallest magnitude; this can improve generalization, but sometimes it also eliminates the wrong weights (=-=Hassibi and Stork 1993-=-). A more complex but more reliable approach, called Optimal Brain Damage (Le Cun et al, 1990b), identifies the weights whose removal will cause the least increase in the network's output error functi... |

132 |
Minimum prediction residual principle applied to speech recognition
- Itakura
- 1975
(Show Context)
Citation Context ...size grows. For example, the 10 digits "zero" to "nine" can be recognized essentially perfectly (Doddington 1989), but vocabulary sizes of 200, 5000, or 100000 may have error rates=-= of 3%, 7%, or 45% (Itakura 1975, Miy-=-atake 1990, Kimura 1990). On the other hand, even a small vocabulary can be hard to recognize if it contains confusable words. For example, the 26 letters of the English alphabet (treated as 26 "... |

123 |
The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition
- Price, Fisher, et al.
- 1988
(Show Context)
Citation Context ...4). 5.3. Resource Management In order to fairly compare our results against those of researchers outside of CMU, we also ran experiments on the DARPA speaker-independent Resource Management database (=-=Price et al 1988-=-). This is a standard database consisting of 3990 training sentences in the domain of naval resource management, recorded by 109 speakers contributing roughly 36 sentences each; this training set has ... |

108 | Neurocomputing: Foundations of Research - Anderson, Rosenfeld - 1988 |

95 | Review of neural networks for speech recognition - Lippmann - 1989 |

91 | Self-Organization and Associative Memory (3rd Edition - Kohonen - 1988 |

90 | Graded State Machines: The representation of temporal contingencies in simple recurrent networks - Servan-Schreiber, Cleeremans, et al. - 1991 |

88 | A simple weight decay can improve generalization
- Krogh, Hertz
- 1995
(Show Context)
Citation Context ... hence this rule must be modified to prevent the weights from growing too large. One of several viable modifications is Sanger's Rule (Sanger 1989): (37) This can be viewed as a form of weight decay (=-=Krogh and Hertz, 1992-=-). This rule uses nonlocal information, but it has the nice property that the M weight vectors w j converge to the first M principal component directions, in order, normalized to unit length. Linsker ... |

86 |
The use of a one-stage dynamic programming algorithm for connected word recognition
- Ney
- 1984
(Show Context)
Citation Context ...e the cumulative score, rather than to minimize it. A particularly important variation of DTW is an extension from isolated to continuous speech. This extension is called the One Stage DTW algorithm (=-=Ney 1984-=-). Here the goal is to find the optimal alignment between the speech sample and the best sequence of reference words (see Figure 2.5). The complexity of the extended algorithm is still linear in the l... |

83 |
The N-Best Algorithm: An Efficient and Exact Procedure for Finding the N Most Likely Sentence Hypotheses
- Schwartz, Chow
- 1990
(Show Context)
Citation Context ...- the sentence hypothesis for the utterance. Actually it is common to return several such sequences, namely the ones with the highest scores, using a variation of time alignment called N-best search (=-=Schwartz and Chow, 1990-=-). This allows a recognition system to make two passes through the unknown utterance: the first pass can use simplified models in order to quickly generate an N-best list, and the second pass can use ... |

72 |
Pattern recognizing stochastic learning automata
- Barto, Anandan
- 1985
(Show Context)
Citation Context ...ither be deterministic or probabilistic. In the case of static environments (with either deterministic or stochastic evaluations), networks can be trained by the associative reward-penalty algorithm (=-=Barto and Anandan 1985-=-). This algorithm assumes stochastic output units (as in Figure 3.4) which enable the network to try out various behaviors. The problem of semi-supervised learning is reduced to the problem of supervi... |

71 |
The Acoustic-Modelling Problem in Automatic Speech Recognition," unpublished Ph.D, thesis
- Brown
- 1987
(Show Context)
Citation Context ...rmation: (19) Unfortunately, this equation cannot be solved by either direct analysis or reestimation; the only known way to solve it is by gradient descent, and the proper implementation is complex (=-=Brown 1987-=-, Rabiner 1989). We note in passing that MMI is equivalent to using a Maximum A Posteriori (MAP) criterion, in which the expression to be maximized is P(M c |Y), rather than P(Y|M c ). To see this, no... |

69 | Global optimization of a neural network-hidden Markov model hybrid - Bengio, Mori, et al. - 1992 |

69 | A distributed connectionist production system - Touretzky, Hinton - 1988 |

65 |
Improving Connected Letter Recognition by Lipreading
- Bregler, Manke, et al.
- 1993
(Show Context)
Citation Context ...ure was initially developed for phoneme recognition (Lang 1989, Waibel et al 1989), but it has also been applied to handwriting recognition (Idan et al, 1992, Bodenhausen and Manke 1993), lipreading (=-=Bregler et al, 1993), and oth-=-er tasks. The TDNN operates on two-dimensional input fields, where the horizontal dimension is time 1 . Connections are "time delayed" to the extent that their connected units are temporally... |

57 |
Large vocabulary continuous speech recognition using htk
- Woodland, Odell, et al.
- 1994
(Show Context)
Citation Context ...from quantization errors if the codebook is too small, while increasing the codebook size would leave less training data for each codeword, likewise degrading performance. . Continuous density model (=-=Woodland et al, 1994-=-). Quantization errors can be eliminated by using a continuous density model, instead of VQ codebooks. In this approach, the probability distribution over acoustic space is modeled directly, by assumi... |

54 | Equivalence proofs for multi-layer perceptron classifiers and the Bayes discriminant function - Hampshire, Perlmutter - 1990 |

53 |
Fast Learning
- Moody, Darken
- 1989
(Show Context)
Citation Context ...in order to cluster the data, and then backpropagation is applied at the higher layer(s) to associate these clusters with the desired output patterns. For example, in a Radial Basis Function network (=-=Moody and Darken 1989-=-), the hidden layer contains units that describe hyperspheres (trained with a standard competitive learning algorithm), while the output layer computes normalized linear combinations of these receptiv... |

53 |
Continuous speech recognition using multilayer perceptrons with hidden Markov models
- Morgan, Bourlard
- 1990
(Show Context)
Citation Context ...neural networks, establishing that neural networks estimate posterior probabilities which should be divided by priors in order to yield likelihoods for use in an HMM. Subsequent work at ICSI and SRI (=-=Morgan & Bourlard 1990-=-, Renals et al 1992, Bourlard & Morgan 1994) confirmed this insight in a series of experiments leading to excellent results on the Resource Management database. The simple MLPs in these experiments ty... |

50 | Discovering the hidden structure of speech - Elman, Zipser - 1988 |