Results 1 - 10
of
26
Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations
- PROC. IEEE
, 1998
"... ..."
Incorporating Information From Syllable-length Time Scales into Automatic Speech Recognition
- In ICASSP
, 1998
"... Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the ex ..."
Abstract
-
Cited by 45 (4 self)
- Add to MetaCart
Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the explicit use of such long-timespan units is comparatively unusual in automatic speech recognition systems for English. The work described in this thesis explored the utility of information collected over syllable-related time-scales. The first approach involved integrating syllable segmentation information into the speech recognition process. The addition of acoustically-based syllable onset estimates [184] resulted in a 10% relative reduction in word-error rate. The second approach began with developing four speech recognition systems based on long-time-span features and units, including modulation spectro- gram features [80]. Error analysis suggested the strategy of combining, which led to the implementation of methods that merged the outputs of syllable-based recognition systems with the phone-oriented baseline system at the frame level, the syllable level and the whole-utterance level. These combined systems exhibited relative improvements of 20-40% compared to the baseline system for clean and reverberant speech test cases.
What HMMs can do
, 2002
"... Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabil ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
Since their inception over thirty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems — today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial analyzes HMMs by exploring a novel way in which an HMM can be defined, namely in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no theoretical limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM for ASR, we should rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Spectral Signal Processing for ASR
- Proc. ASRU’99
, 1999
"... The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The paper begins by discussing the difficulties in obtaining repeatable results in speech recognition. Theoretical arguments are presented for and against copying human auditory properties in automatic speech recognition. The "standard" acoustic analysis for automatic speech recognition, consisting of melscale cepstrum coefficients and their temporal derivatives, is described. Some variations and extensions of the standard analysis --- PLP, cepstrum correlation methods, LDA, and variants on log power --- are then discussed. These techniques pass the test of having been found useful at multiple sites, especially with noisy speech. The extent to which auditory properties can account for the advantage found for particular techniques is considered. It is concluded that the advantages do not in fact stem from auditory properties, and that there is so far little or no evidence that the study of the human auditory system has contributed to advances in automatic speech recognition. Contributio...
Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations
- IEEE
, 2006
"... We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental soun ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multilinear dimensionality reduction technique and classified by a support vector machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches (Scheirer and Slaney, 2002 and Kingsbury et al., 2002). The results demonstrate the advantages of the auditory model over the other two systems, especially at low signal-to-noise ratios (SNRs) and high reverberation.
Perceptually Inspired Signal-processing Strategies for Robust Speech Recognition in Reverberant Environments
, 1998
"... Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific pr ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Natural, hands-free interaction with computers is currently one of the great unfulfilled promises of automatic speech recognition (ASR), in part because ASR systems cannot reliably recognize speech under everyday, reverberant conditions that pose no problems for most human listeners. The specific properties of the auditory representation of speech likely contribute to reliable human speech recognition under such conditions. This dissertation explores the use of perceptually inspired signal-processing strategies -- critical-band-like frequency analysis, an emphasis of slow changes in the spectral structure of the speech signal, adaptation, integration of phonetic information over syllabic durations, and use of multiple signal representations for...
Speech processing in vocoder-centric cochlear implants
- IN "COCHLEAR AND BRAINSTEM IMPLANTS"
, 2006
"... The principles of most recent cochlear implant processor are similar to that of the channel vocoder, originally used for transmitting speech over telephone lines with much less bandwidth than that required for transmitting the unprocessed speech signal. An overview of the various vocoder-centric pro ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
The principles of most recent cochlear implant processor are similar to that of the channel vocoder, originally used for transmitting speech over telephone lines with much less bandwidth than that required for transmitting the unprocessed speech signal. An overview of the various vocoder-centric processing strategies proposed for cochlear implants since the late 1990s is provided including the strategies used in different commercially available implant processors. Special emphasis is placed on reviewing the strategies designed to enhance pitch information for potentially better music perception. The various noise suppression strategies proposed over the years based on multi-microphone and single-microphone inputs are also described.
High-Quality and Flexible Speech Synthesis with Segment Selection and Voice Conversion
, 2003
"... Text-to-Speech (TTS) is a useful technology that converts any text into a speech signal. It can be utilized for various purposes, e.g. car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading. Corpus-based TTS makes it possible to dramatically im ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Text-to-Speech (TTS) is a useful technology that converts any text into a speech signal. It can be utilized for various purposes, e.g. car navigation, announcements in railway stations, response services in telecommunications, and e-mail reading. Corpus-based TTS makes it possible to dramatically improve the naturalness of synthetic speech compared with the early TTS. However, no general-purpose TTS has been developed that can consistently synthesize su#- ciently natural speech. Furthermore, there is not yet enough flexibility in corpusbased TTS.
Improving Asr Performance For Reverberant Speech
, 1997
"... The performance of current automatic speech recognition (ASR) systems is very sensitive to the presence of room reverberation in the incoming speech signal. We investigate a family of front-end speech representations that focus on slowchanges in the the gross spectral structure of speech for their a ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
The performance of current automatic speech recognition (ASR) systems is very sensitive to the presence of room reverberation in the incoming speech signal. We investigate a family of front-end speech representations that focus on slowchanges in the the gross spectral structure of speech for their ability to improve the robustness of ASR systems to reverberation. A number of the front ends provide a statistically significant improvement in performance over established front ends such as PLP; however, the performance of ASR systems on highly reverberant speech is still disappointing when compared with the performance of human listeners.

