• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions (2000)

by H G Hirsch, D Pearce
Venue:in ISCA ITRW ASR2000
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 534
Next 10 →

The graphical models toolkit: An open source software system for speech and time-series processing

by Jeff Bilmes, Geoffrey Zweig - In Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing , 2002
"... This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphical-model based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may ..."
Abstract - Cited by 124 (30 self) - Add to MetaCart
This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphical-model based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may rapidly specify a vast collection of statistical models. This paper begins with a brief description of the representational and computational aspects of the framework. Following that is a detailed description of GMTK’s features, including a language for specifying structures and probability distributions, logarithmic space exact training and decoding procedures, the concept of switching parents, and a generalized EM training method which allows arbitrary sub-Gaussian parameter tying. Taken together, these features endow GMTK with a degree of expressiveness and functionality that significantly complements other publically available packages. GMTK was recently used in the 2001 Johns Hopkins Summer Workshop, and experimental results are described in detail both herein and in a companion paper. 1.
(Show Context)

Citation Context

...0 89.2 66.4 21.5 GMTK-PH 99.1 98.3 97.2 94.9 86.4 54.9 2.80 HP 98.5 97.3 96.2 93.6 85.0 57.6 24.0 Table 1. Word recognition rates: baseline GMTK emulating an HMM system as function of SNR. HP is from =-=[12]-=-. from parent values to child distribution is specified using a decision tree, allowing a sparse representation of this mapping. A vector observation variable spans over a region of the feature vector...

CMU Arctic Databases for Speech Synthesis

by John Kominek, Alan W Black, Ver Ver , 2003
"... This report introduces the CMU Arctic databases designed for the purpose of speech synthesis research. These single speaker speech databases have been carefully recorded under studio conditions and consist of nearly 1150 phonetically balanced English utterances. They are distributed as free software ..."
Abstract - Cited by 78 (7 self) - Add to MetaCart
This report introduces the CMU Arctic databases designed for the purpose of speech synthesis research. These single speaker speech databases have been carefully recorded under studio conditions and consist of nearly 1150 phonetically balanced English utterances. They are distributed as free software, without restriction on commercial or non-commercial use. The Arctic corpus consists of four primary sets of recordings (3 male, 1 female), plus several ancillary databases. Each database is distributed with automatically segmented phonetic labels. These extra files were derived using the standard voice building scripts of the Festvox system. In addition to phonetic labels, the databases provide complete support for the Festival Speech Synthesis System, including pre-built voices that may be used as is. Festival and Festvox are available at
(Show Context)

Citation Context

...marily with automatic speech recognition in mind. Prominent examples include TIDIGITS [15] (isolated word recognition), SWITCHBOARD [10] and CALLHOME [4] (spontaneous phone conversations), and Aurora =-=[12]-=- (noisy speech). Databases that are designed for training and testing of ASR systems require large amounts of speech collected under realistic and noisy conditions, by multiple speakers with broadly v...

The second ’CHiME’ Speech Separation and Recognition Challenge: Datasets, tasks and baselines,” in ICASSP,

by Jon Barker , Emmanuel Vincent , Ning Ma , Heidi Christensen , Phil Green , Jon Barker , Emmanuel Vincent , Ning Ma , Heidi Christensen , Jon Barker , Emmanuel Vincent , Ning Ma , Heidi Christensen , Phil Green , Jon Barker , 2013
"... Abstract Distant microphone speech recognition systems that operate with humanlike robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple comp ..."
Abstract - Cited by 70 (27 self) - Add to MetaCart
Abstract Distant microphone speech recognition systems that operate with humanlike robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into audio backgrounds binaurally-recorded in a busy domestic environment. The challenge was designed to model the essential difficulties of multisource environment problem while remaining on a scale that would make it accessible to a wide audience. Compared to previous ASR evaluation a particular novelty of the task is that the utterances to be recognised were provided in a continuous audio background rather than as pre-segmented utterances thus allowing a range of background modelling techniques to be employed. The challenge attracted thirteen submissions. This paper describes the challenge problem, provides an overview of the systems that were entered and provides a comparison alongside both a baseline recognition system and human performance. The paper discusses insights gained from the challenge and lessons learnt for the design of future such evaluations.
(Show Context)

Citation Context

...upled systems which treat separation and recognition as independent consecutive processing stages. One of the primary objectives of the Pascal CHiME speech separation and recognition challenge has been to draw together the source separation and speech recognition communities with the hope of stimulating fresh and more deeply coupled approaches to distant speech recognition. To this end the task has been designed to be widely accessible while capturing the difficulties that make dis2 tant speech recognition a hard problem. Compared to the still widely reported Aurora 2 speech recognition task (Pearce and Hirsch, 2000), the CHiME task is more challenging along a number of dimensions: like Aurora 2 it is built around a small vocabulary speech corpus but it contains many acoustically confusable utterances that rely on finer phonetic distinctions than those required to disambiguate Aurora’s digit sequences; the target utterances have been reverberantly mixed into complex multisource noise backgrounds recorded in real everyday living environments; the exploitation of spatial source separation is enabled by the provision of two-channel ‘binaurally recorded’ signals that mimic the signals that would be received b...

Histogram Equalization of the Speech Representation for Robust Speech Recognition

by Angel de la Torre, Antonio M. Peinado, Jose C. Segura, Jose L. Perez, Carmen Bentez, Antonio J. Rubio , 2001
"... The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the ..."
Abstract - Cited by 61 (4 self) - Add to MetaCart
The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the methods proposed for robust speech recognition try to compensate the noise effect either by obtaining an estimation of the clean speech or by adapting the recognizer acoustic models for a proper modeling of the noisy speech. In this paper we propose a method to compensate the noise effect over the speech representation. This method is based on the histogram equalization technique frequently applied for Digital Image Processing, which has been adapted to the speech representation. For each component of the feature vectors representing the speech signal, the histogram is estimated and the transformation which converts it into a reference histogram is calculated. Such transformations tend to compensate the distortion the noise produces over the different components of the feature vector and improve the performance of the recognition systems under noise conditions. We describe how the histogram equalization method can be adapted to robust speech recognition and present some recognition experiments to evaluate the proposed method.

Uncertainty decoding with SPLICE for noise robust speech recognition

by Alex Acero, Li Deng - In Proc. ICASSP , 2002
"... Speech recognition front end noise removal algorithms have, in the past, estimated clean speech features from corrupted speech features. The accuracy of the noise removal process varies from frame to frame, and from dimension to dimension in the feature stream, due in part to the instantaneous SR of ..."
Abstract - Cited by 58 (4 self) - Add to MetaCart
Speech recognition front end noise removal algorithms have, in the past, estimated clean speech features from corrupted speech features. The accuracy of the noise removal process varies from frame to frame, and from dimension to dimension in the feature stream, due in part to the instantaneous SR of the input. In this paper, we show that localized knowledge of the accuracy of the noise removal process can be directly incorporated into the Gaussian evaluation within the decoder, to produce higher recognition accuracies. To prove this concept, we modify the SPLICE algorithm to output uncertainty information, and show that the combination of SPLICE with uncertainty decoding can remove 74.2 % of the errors in a subset of the Aurora2 task. 1.
(Show Context)

Citation Context

...ch is consistent with the bounds described by the dashed lines in the lower half of Figure 2.4.2. Quantitative Several connected digit experiments were run using the framework provided in the Aurora2=-=[8]-=- corpus. The acoustic model training data consists of 8440 clean utterances and the same utterances in groups of 422, corrupted 20 different ways. These 20 sets consist of four noise types (subway, ba...

Exemplar-based sparse representations for noise robust automatic speech recognition

by Jort F. Gemmeke, et al. , 2010
"... ..."
Abstract - Cited by 55 (30 self) - Add to MetaCart
Abstract not found

Discrimination of Speech from Non-speech based on Multiscale Spectrotemporal Modulations

by Nima Mesgarani, Master Of Science, Shihab Shamma, Nima Mesgarani - IEEE Transactions on Audio, Speech, and Language Processing , 2006
"... We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task ex-plored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental soun ..."
Abstract - Cited by 53 (3 self) - Add to MetaCart
We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task ex-plored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multi-linear dimensionality reduction technique and classified by a Support Vector Machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches [1] [2]. The results demonstrate the advantages of the auditory model over the other two systems, especially at low SNRs and high reverberation.
(Show Context)

Citation Context

...usic samples that covered a large variety of musical styles were selected from RWC Genre Database [34] (349 for training, 185 for test). Environmental sounds were assembled from Noisex [35] and Auroa =-=[36]-=- databases which have stationary and non-stationary sounds including white and pink noise, factory, jets, destroyer engine, military vehicles, cars and several speech babble recorded in different envi...

The CHiME corpus: a resource and a challenge for Computational Hearing in Multisource Environments

by Heidi Christensen, Jon Barker, Ning Ma, Phil Green - in Proc. Interspeech’10, Makuhari , 2010
"... We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs sp ..."
Abstract - Cited by 52 (5 self) - Add to MetaCart
We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results.
(Show Context)

Citation Context

...de variety of algorithms exist for handling ‘special case’ noise backgrounds (stationary noise or slowly adapting noise, speech plus speech, speech babble, noise with a predictable temporal structure =-=[1, 2, 3, 4, 5]-=-), however, these algorithms can be very brittle and often fail badly in more general conditions. We wish to record data with a complexity that is representative of everyday listening conditions. We c...

Efficient voice activity detection algorithms using long-term speech information

by Javier Ramırez, Jose C. Segura, Carmen Benıtez, Angel De La Torre, Antonio Rubio - Speech Communication , 2004
"... ..."
Abstract - Cited by 51 (10 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...nition systems. A new technique for speech/ non-speech detection (SND) using long-term information about the speech signal is studied. The algorithm is evaluated in the context of the AURORA project (=-=Hirsch and Pearce, 2000-=-; ETSI, 2000), and the recently approved Advanced Front-end standard (ETSI, 2002) for distributed speech recognition. The quantifiable benefits of this approach are assessed by means of an exhaustive ...

Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,”

by L Deng, J Droppo, A Acero - IEEE Trans. Speech and Audio Processing, , 2005
"... ..."
Abstract - Cited by 50 (0 self) - Add to MetaCart
Abstract not found
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University