Results 1 -
9 of
9
Moving Beyond the `Beads-On-A-String' Model of Speech
- In Proc. IEEE ASRU Workshop
, 1999
"... The notion that a word is composed of a sequence of phone segments, sometimes referred to as `beads on a string', has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. ..."
Abstract
-
Cited by 48 (0 self)
- Add to MetaCart
The notion that a word is composed of a sequence of phone segments, sometimes referred to as `beads on a string', has formed the basis of most speech recognition work for over 15 years. However, as more researchers tackle spontaneous speech recognition tasks, that view is being called into question. This paper raises problems with the phoneme as the basic subword unit in speech recognition, suggesting that finer-grained control is needed to capture the sort of pronunciation variability observed in spontaneous speech. We offer two different alternatives -- automatically derived subword units and linguistically motivated distinctive feature systems -- and discuss current work in these directions. In addition, we look at problems that arise in acoustic modeling when trying to incorporate higher-level structure with these two strategies. 1. INTRODUCTION It has often been noted that automatic speech recognition performance is much worse on spontaneous speech than on carefully planned or r...
Automatic generation of subword units for speech recognition systems
- IEEE Transactions on Speech and Audio Processing
"... Abstract—Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The perfo ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Abstract—Large vocabulary continuous speech recognition (LVCSR) systems traditionally represent words in terms of smaller subword units. Both during training and during recognition, they require a mapping table, called the dictionary, which maps words into sequences of these subword units. The performance of the LVCSR system depends critically on the definition of the subword units and the accuracy of the dictionary. In current LVCSR systems, both these components are manually designed. While manually designed subword units generalize well, they may not be the optimal units of classification for the specific task or environment for which an LVCSR system is trained. Moreover, when human expertise is not available, it may not be possible to design good subword units manually. There is clearly a need for data-driven design of these LVCSR components. In this paper, we present a complete probabilistic formulation for the automatic design of subword units and dictionary, given only the acoustic data and their transcriptions. The proposed framework permits easy incorporation of external sources of information, such as the spellings of words in terms of a nonideographic script. Index Terms—Learning, lexical representation, maximum-likelihood, speech recognition, subword units.
Speech Recognition System Design Based on Automatically Derived Units
, 1999
"... In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
In most speech recognition systems today, acoustic modeling and lexical modeling are viewed as separable problems. Currently the most popular approach is to manually define canonical word pronunciations in terms of phonetic units and let the acoustic models capture differences between actual spoken and canonical pronunciations implicitly with Gaussian mixture models. As a result, these models can be very broad, particularly for casual spontaneous speech. An alternative approach, explored in this thesis, is to learn a unit inventory and pronunciation dictionary from training data using a maximum likelihood objective function. In particular,
Combined Optimisation of Baseforms and Model Parameters in Speech Recognition Based on Acoustic Subword Units
- in Proc. IEEE Workshop on Automatic Speech Recognition
, 1997
"... A major challenge in speech recognition is creating a lexicon which is robust to inter- and intra-speaker variations. This is even more so in speech recognisers based on non-linguistic units, e.g., acoustic subword units (ASWUs), since no standard pronunciation dictionaries are available. Thus the b ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
A major challenge in speech recognition is creating a lexicon which is robust to inter- and intra-speaker variations. This is even more so in speech recognisers based on non-linguistic units, e.g., acoustic subword units (ASWUs), since no standard pronunciation dictionaries are available. Thus the baseforms describing the vocabulary words in terms of the recognition units need to be generated from training data. In this paper we propose an algorithm for ASWU-based speech recognition which performs a combined optimisation of the baseforms and the subword models. The resulting system has been tested on the DARPA Resource Management task, and is shown to perform comparable to a baseline phoneme based system. 1 Introduction Most contemporary speech recognisers employ some kind of phonemic units as the basic modelling entity. Such recognisers need a lexicon of baseforms describing the composition of the vocabulary words in terms of the recognition units, e.g., as a concatenation of phonemi...
Incorporating Linguistic Knowledge And Automatic Baseform Generation In Acoustic Subword Unit Based Speech Recognition
"... A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to inter- and intra-speaker variations. In this paper we present two different approaches for incorporating simple word-level linguistic knowledge into the labelling step of the training pro ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to inter- and intra-speaker variations. In this paper we present two different approaches for incorporating simple word-level linguistic knowledge into the labelling step of the training procedure. The proposed systems also utilise a scheme for combined optimisation of baseforms and subword models. For the TI46 database, these methods are shown to greatly improve the performance compared to an acoustic subword based speech recogniser employing unsupervised labelling, and they are found to perform as well as systems utilising whole-word models and context independent phoneme models. 1. INTRODUCTION Traditionally, automatic speech recognisers employ phone-like units based upon a linguistic description of the language. On the other hand, the analysis of the actual speech signal is acoustically based. The resulting system is neither phonetically nor acoustically consistent, but is...
A Joint Segmentation and Labelling Scheme for use in Acoustic Subword Based Speech Recognition
"... A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to inter- and intra-speaker variations. In this paper we present a joint segmentation and labelling scheme to incorporate word-level linguistic knowledge into the training procedure. The pro ..."
Abstract
- Add to MetaCart
A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to inter- and intra-speaker variations. In this paper we present a joint segmentation and labelling scheme to incorporate word-level linguistic knowledge into the training procedure. The proposed system is also based on a combined optimisation of the baseforms and the subword models. For the TI46 database, this method is shown to greatly improve the performance compared to an acoustic subword based speech recogniser employing unsupervised labelling, and is found to perform as well as systems utilising whole-word models and context independent phoneme models. 1. INTRODUCTION Traditionally, the subword units employed in automatic speech recognition have been defined based upon a linguistic description of the language. A major disadvantage of this approach is the inherent mismatch between the acoustically based analysis of the actual speech signal and the linguistically based des...
On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR
, 2006
"... published in ..."
On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR ⋆
"... It is often acknowledged that speech signals contain short-term and long-term temporal properties [15] that are difficult to capture and model by using the usual fixed scale (typically 20ms) short time spectral analysis used in hidden Markov models (HMMs), based on piecewise stationarity and state c ..."
Abstract
- Add to MetaCart
It is often acknowledged that speech signals contain short-term and long-term temporal properties [15] that are difficult to capture and model by using the usual fixed scale (typically 20ms) short time spectral analysis used in hidden Markov models (HMMs), based on piecewise stationarity and state conditional independence assumptions of acoustic vectors. For example, vowels are typically quasi-stationary over 40-80ms segments, while plosive typically require analysis below 20ms segments. Thus, fixed scale analysis is clearly sub-optimal for “optimal ” time-frequency resolution and modeling of different stationary phones found in the speech signal. In the present paper, we investigate the potential advantages of using variable size analysis windows towards improving state-of-the-art speech recognition systems. Based on the usual assumption that the speech signal can be modeled by a timevarying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the Linear Prediction
Automatic determination of sub-word units for automatic speech recognition
, 2008
"... Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data.
Two ..."
Abstract
- Add to MetaCart
Current automatic speech recognition (ASR) research is focused on recognition of continuous, spontaneous speech. Spontaneous speech contains a lot of variability in the way words are pronounced, and canonical pronunciations of each word are not true to the variation that is seen in real data.
Two of the components of an ASR system are acoustic models and pronunciation models. The variation within spontaneous speech must be accounted for by these components. Phones, or context-dependent phones are typically used as the base subword unit, and one acoustic model is trained for each sub-word unit. Pronunciation modelling largely takes place in a dictionary, which relates words to sequences of phones. Acoustic modelling and pronunciation modelling overlap, and the two are not clearly separable in modelling pronunciation variation. Techniques that find pronunciation variants in the data and then reflect these in the dictionary have not provided expected gains in recognition.
An alternative approach to modelling pronunciations in terms of phones is to derive units automatically: using data-driven methods to determine an inventory of sub-word units, their acoustic models, and their relationship to words. This thesis presents a method for the automatic derivation of a sub-word unit inventory, whose main components are
1. automatic and simultaneous generation of a sub-word unit inventory and acoustic model set, using an ergodic hidden Markov model whose complexity is controlled using the Bayesian Information Criterion
2. automatic generation of probabilistic dictionaries using joint multigrams
The prerequisites of this approach are fewer than in previous work on unit derivation; notably, the timings of word boundaries are not required here. The approach is language independent since it is entirely data-driven and no linguistic information is required. The dictionary generation method outperforms a supervised method using phonetic data. The automatically derived units and dictionary perform reasonably on a small spontaneous speech task, although not yet outperforming phones.

