Results 1 -
5 of
5
Ageing voices: The effect of changes in voice parameters on ASR performance
"... With ageing, human voices undergo several changes which are typically characterized by increased hoarseness and changes in articulation patterns. In this study, we have examined the effect on Automatic Speech Recognition (ASR) and found that the Word Error Rates (WER) on older voices is about 9 % ab ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
With ageing, human voices undergo several changes which are typically characterized by increased hoarseness and changes in articulation patterns. In this study, we have examined the effect on Automatic Speech Recognition (ASR) and found that the Word Error Rates (WER) on older voices is about 9 % absolute higher compared to those of adult voices. Subsequently, we compared several voice source parameters including fundamental frequency, jitter, shimmer, harmonicity and cepstral peak prominence of adult and older males. Several of these parameters show statistically significant difference for the two groups. However, artificially increasing jitter and shimmer measures do not effect the ASR accuracies significantly. Artificially lowering the fundamental frequency degrades the ASR performance marginally but this drop in performance can be overcome to some extent using Vocal Tract Length Normalisation (VTLN). Overall, we observe that the changes in the voice source parameters do not have a significant impact on ASR performance. Comparison of the likelihood scores of all the phonemes for the two age groups show that there is a systematic mismatch in the acoustic space of the two age groups. Comparison of the phoneme recognition rates show that mid vowels, nasals and phonemes that depend on the ability to create constrictions with tongue tip for articulation are more affected by ageing than other phonemes.
A Corpus Study of the Prosody of Polysyllabic Words in Mandarin Chinese
"... This paper presents a corpus study of polysyllabic words in Standard Mandarin Chinese. In particular, this study investigates their prosodic features with respect to the notions of prosodic strength and stress. We find a robust strong-weak alternation with respect to F0, but different patterns for d ..."
Abstract
- Add to MetaCart
This paper presents a corpus study of polysyllabic words in Standard Mandarin Chinese. In particular, this study investigates their prosodic features with respect to the notions of prosodic strength and stress. We find a robust strong-weak alternation with respect to F0, but different patterns for duration. In disyllabic words the first syllable tends be slightly longer than the second. However, for three and four syllable words the last syllable is the longest, followed by the first. These patterns suggest that F0 is a reliable phonetic indicator of metrical structure in Mandarin Chinese, rather than duration.
London English? 1 Rhythm in Speech 1.1 Analyzing Speech Rhythm
"... It has been shown that speech rhythm in languages cannot be assigned to discrete categories such as stress timed and syllable timed (Abercrombie 1967). In fact, there is a continuum from more syllable-timed to more stress-timed languages (Dauer 1983, Miller 1984). Different metrics have been propose ..."
Abstract
- Add to MetaCart
It has been shown that speech rhythm in languages cannot be assigned to discrete categories such as stress timed and syllable timed (Abercrombie 1967). In fact, there is a continuum from more syllable-timed to more stress-timed languages (Dauer 1983, Miller 1984). Different metrics have been proposed to quantify speech rhythm across languages, focusing on either phonotactic structures or on differences in vowel and consonant phonology (Thomas 2011:195-197). Low et al. (2000) have proposed the Pairwise Variability Index (PVI) which is a measure of the average relative difference between successive pairs of units such as vowels and consonants in adjacent syllables. Duration, which is linked to time, is the most frequent unit to measure (Nolan and Asu 2009). So-called syllable-timed languages will have a near-equal duration of units (e.g. syllables or vowels). Examples of such languages are Mandarin and Spanish. Stress-timed languages, on the other hand, will have larger durational variability of units. Such languages are English, German and
Mining a Year of Speech
"... The availability of large text corpora has revolutionized linguistics and is of great value in many other areas of scholarship. Our “Mining a Year of Speech ” project, funded by the transatlantic “Digging into Data ” competition, aims to do the same for spoken language. We present a new generation o ..."
Abstract
- Add to MetaCart
The availability of large text corpora has revolutionized linguistics and is of great value in many other areas of scholarship. Our “Mining a Year of Speech ” project, funded by the transatlantic “Digging into Data ” competition, aims to do the same for spoken language. We present a new generation of speech corpora, characterised by aggregation of datasets, annotated using forced alignment and exposed for public use in standard formats across multiple sites. Index Terms: speech corpora, aggregation, forced alignment 1.
Automatic Phone Alignment A Comparison between Speaker-Independent Models and Models Trained on the Corpus to Align
"... Abstract. Several automatic phonetic alignment tools have been proposed in the literature. They generally use speaker-independent acoustic models of the language to align new corpora. The problem is that the range of provided models is limited. It does not cover all languages and speaking styles (sp ..."
Abstract
- Add to MetaCart
Abstract. Several automatic phonetic alignment tools have been proposed in the literature. They generally use speaker-independent acoustic models of the language to align new corpora. The problem is that the range of provided models is limited. It does not cover all languages and speaking styles (spontaneous, expressive, etc.). This study investigates the possibility of directly training the statistical model on the corpus to align. The main advantage is that it is applicable to any language and speaking style. Moreover, comparisons indicate that it provides as good or better results than using speaker-independent models of the language. It shows that about 2 % are gained, with a 20 ms threshold, by using our method. Experiments were carried out on neutral and expressive corpora in French and English. The study also points out that even a small neutral corpus of a few minutes can be exploited to train a model that will provide high-quality alignment.

