• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Phoneme probability estimation with dynamic sparsely connected artificial neural networks”, The Free Speech (1997)

by N Ström
Venue:Journal
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 12
Next 10 →

Support vector machines for speech recognition

by Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone - Proceedings of the International Conference on Spoken Language Processing , 1998
"... Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative informati ..."
Abstract - Cited by 47 (2 self) - Add to MetaCart
Statistical techniques based on hidden Markov Models (HMMs) with Gaussian emission densities have dominated signal processing and pattern recognition literature for the past 20 years. However, HMMs trained using maximum likelihood techniques suffer from an inability to learn discriminative information and are prone to overfitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. In this paper, we show that SVMs provide a significant improvement in performance on a static pattern classification task based on the Deterding vowel data. We also describe an application of SVMs to large vocabulary speech recognition, and demonstrate an improvement in error rate on a continuous alphadigit task (OGI Aphadigits) and a large vocabulary conversational speech task (Switchboard). Issues related to the development and optimization of an SVM/HMM hybrid system are discussed.

Trainable articulatory control models for visual speech synthesis

by Jonas Beskow - Journal of Speech Technology , 2004
"... Abstract. This paper deals with the problem of modelling the dynamics of articulation for a parameterised talking head based on phonetic input. Four different models are implemented and trained to reproduce the articulatory patterns of a real speaker, based on a corpus of optical measurements. Two o ..."
Abstract - Cited by 16 (4 self) - Add to MetaCart
Abstract. This paper deals with the problem of modelling the dynamics of articulation for a parameterised talking head based on phonetic input. Four different models are implemented and trained to reproduce the articulatory patterns of a real speaker, based on a corpus of optical measurements. Two of the models, (“Cohen-Massaro ” and “Öhman”) are based on coarticulation models from speech production theory and two are based on artificial neural networks, one of which is specially intended for streaming real-time applications. The different models are evaluated through comparison between predicted and measured trajectories, which shows that the Cohen-Massaro model produces trajectories that best matches the measurements. A perceptual intelligibility experiment is also carried out, where the four data-driven models are compared against a rule-based model as well as an audio-alone condition. Results show that all models give significantly increased speech intelligibility over the audio-alone case, with the rule-based model yielding highest intelligibility score. Keywords: perceptual evaluation

PARSING SPEECH INTO ARTICULATORY EVENTS

by Kadri Hacioglu , et al.
"... In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
In this paper, the speech production process state is defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. However, we extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.

Predicting Underlying Pitch Targets for Intonation Modeling

by Xuejing Sun - Proc. of the 4th ISCA Tutorial and Research Workshop on Speech Synthesis, Perthshire , 2001
"... The present paper reports our preliminary attempt on modeling intonation using underlying pitch targets. The underlying pitch targets were derived using a nonlinear regression technique under the pitch target approximation model [17, 19]. We assume that the use of underlying pitch targets can captur ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
The present paper reports our preliminary attempt on modeling intonation using underlying pitch targets. The underlying pitch targets were derived using a nonlinear regression technique under the pitch target approximation model [17, 19]. We assume that the use of underlying pitch targets can capture the most important intonation patterns while maintaining critical predictive power. Another important aspect of our approach is that we do not rely on pitch accent as a component in the system. To predict the parameters of the underlying targets, we used a recurrent neural network combined with a time-delay window. Comparing the predicted and original pitch targets, the root mean square error (RMSE) is 7.90 Hz, and the correlation coefficient (r) is 0.78. The results are encouraging and suggesting that the use of underlying pitch targets is a promising approach to intonation modeling.

Capturing fine-phonetic variation in speech through automatic classification of articulatory features

by Odette Scharenborg, Vincent Wan, Roger K. Moore - In: Proceedings of the workshop on Speech Recognition and Intrinsic Variation , 2006
"... The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create re ..."
Abstract - Cited by 3 (2 self) - Add to MetaCart
The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create reliable and accurate transcriptions of the articulatory behaviour encoded in the acoustic speech signal. In the experiments reported here, we compared support vector machines (SVMs) with multilayer perceptrons (MLPs). MLPs have been widely (and rather successfully) used for the task of multi-value articulatory feature classification, while (to the best of our knowledge) SVMs have not. This paper compares the performances of the two classifiers and analyses the results in order to better understand the articulatory representations. It was found that the MLPs outperformed the SVMs, but it is concluded that both classifiers exhibit similar behaviour in terms of patterns of errors. 1.

Synthetic Visual Speech Driven From Auditory Speech

by Eva Agelfors, Jonas Beskow, Björn Granström, Magnus Lundeberg, Giampiero Salvi, Karl-erik Spens, Tobias Öhman, Tobias Öhman (in Alphabetical Order - Proc of AVSP 99 , 1999
"... We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
We have developed two different methods for using auditory, telephone speech to drive the movements of a synthetic face. In the first method, Hidden Markov Models (HMMs) were trained on a phonetically transcribed telephone speech database. The output of the HMMs was then fed into a rulebased visual speech synthesizer as a string of phonemes together with time labels. In the second method, Artificial Neural Networks (ANNs) were trained on the same database to map acoustic parameters directly to facial control parameters. These target parameter trajectories were generated by using phoneme strings from a database as input to the visual speech synthesis The two methods were evaluated through audiovisual intelligibility tests with ten hearing impaired persons, and compared to "ideal" articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (5...

Artificial Neural Networks in Recognition of Phonetic Features in Speech

by Todd A. Stephenson , 1998
"... A set of recurrent artificial neural networks are used for speech recognition. By representing speech waves as vectors of mel-cepstral coefficients and energy, we can train a neural network to classify the values of phonetic features in a given sentence of speech. The effectiveness of the training d ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
A set of recurrent artificial neural networks are used for speech recognition. By representing speech waves as vectors of mel-cepstral coefficients and energy, we can train a neural network to classify the values of phonetic features in a given sentence of speech. The effectiveness of the training depends both on the range of possible values for a given feature classification, on the distribution of the values in training samples, and possibly on how well a given feature is represented by cepstral analysis. The nets are able to identify both specific feature values and broader feature value groupings in speech. 1 Introduction A popular method for doing speech recognition is to use hidden Markov models (HMM's). HMM's approach speech recognition is to determine the probability of the speech being a certain utterance. Meanwhile, neural networks take a non-probabilistic approach to problem solving. They can be used to learn the patterns in the input that give the desired output and to gen...

Speech Variation and the Use of Distance Metrics on the Articulatory Feature Space

by Louis Ten Bosch - ITRW Workshop on Speech Recognition and Intrinsic Variation , 2006
"... This paper describes ongoing research on the relation between variation in speech in the articulatory-acoustic domain and the variation as represented in the symbolic domain. More specifically, we address variation in speech as represented by articulatory features, and the description of variation i ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
This paper describes ongoing research on the relation between variation in speech in the articulatory-acoustic domain and the variation as represented in the symbolic domain. More specifically, we address variation in speech as represented by articulatory features, and the description of variation in phone annotation and segmentation. Variation in speech is quantified by using distance metrics defined on the space spanned by articulatory features. We will show a very good correspondence between locations of events in the articulatory feature trajectories on the one hand, and the phone boundary locations as defined by manual segmentation on the other. This indicates that the asynchronous articulatory representation at least captures the information in the segmentation on phone level.

A Tonotopic Artificial Neural Network Architecture For Phoneme Probability Estimation

by Nikko Ström , 1997
"... A novel sparse ANN connection scheme is proposed. It is inspired by the so called tonotopic organization of the auditory nerve, and allows a more detailed representation of the speech spectrum to be input to an ANN than is commonly used. A consequence of the new connection scheme is that more resour ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
A novel sparse ANN connection scheme is proposed. It is inspired by the so called tonotopic organization of the auditory nerve, and allows a more detailed representation of the speech spectrum to be input to an ANN than is commonly used. A consequence of the new connection scheme is that more resources are allocated to analysis within narrow frequency sub-bands -- a concept that has recently been investigated by others with so called sub-band ASR. ANNs with the proposed architecture have been evaluated on the TIMIT database for phoneme recognition, and are found to give better phoneme recognition performance than ANNs based on standard mel frequency cepstrum input. The lowest achieved phone error-rate, 26.7%, is very close to the lowest published result for the core test set of the TIMIT database. 1. Introduction In the most wide-spread type of hybrid HMM/ANN ASR systems, an artificial neural network (ANN) is utilized to compute the observation likelihoods of a hidden Markov model, (e...

Using durational cues in a computational model of spoken-word recognition

by Odette Scharenborg
"... Evidence that listeners use durational cues to help resolve temporarily ambiguous speech input has accumulated over the past few years. In this paper, we investigate whether durational cues are also beneficial for word recognition in a computational model of spoken-word recognition. Two sets of simu ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Evidence that listeners use durational cues to help resolve temporarily ambiguous speech input has accumulated over the past few years. In this paper, we investigate whether durational cues are also beneficial for word recognition in a computational model of spoken-word recognition. Two sets of simulations were carried out using the acoustic signal as input. The simulations showed that the computational model, like humans, takes benefit from durational cues during word recognition, and uses these to disambiguate the speech signal. These results thus provide support for the theory that durational cues play a role in spoken-word recognition. Index Terms: duration, spoken-word recognition, computational modelling
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University