Results 1 - 10
of
534
The graphical models toolkit: An open source software system for speech and time-series processing
- In Proceedings of IEEE Int. Conf. Acoust., Speech, and Signal Processing
, 2002
"... This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphical-model based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may ..."
Abstract
-
Cited by 124 (30 self)
- Add to MetaCart
(Show Context)
This paper describes the Graphical Models Toolkit (GMTK), an open source, publically available toolkit for developing graphical-model based speech recognition and general time series systems. Graphical models are a flexible, concise, and expressive probabilistic modeling framework with which one may rapidly specify a vast collection of statistical models. This paper begins with a brief description of the representational and computational aspects of the framework. Following that is a detailed description of GMTK’s features, including a language for specifying structures and probability distributions, logarithmic space exact training and decoding procedures, the concept of switching parents, and a generalized EM training method which allows arbitrary sub-Gaussian parameter tying. Taken together, these features endow GMTK with a degree of expressiveness and functionality that significantly complements other publically available packages. GMTK was recently used in the 2001 Johns Hopkins Summer Workshop, and experimental results are described in detail both herein and in a companion paper. 1.
CMU Arctic Databases for Speech Synthesis
, 2003
"... This report introduces the CMU Arctic databases designed for the purpose of speech synthesis research. These single speaker speech databases have been carefully recorded under studio conditions and consist of nearly 1150 phonetically balanced English utterances. They are distributed as free software ..."
Abstract
-
Cited by 78 (7 self)
- Add to MetaCart
(Show Context)
This report introduces the CMU Arctic databases designed for the purpose of speech synthesis research. These single speaker speech databases have been carefully recorded under studio conditions and consist of nearly 1150 phonetically balanced English utterances. They are distributed as free software, without restriction on commercial or non-commercial use. The Arctic corpus consists of four primary sets of recordings (3 male, 1 female), plus several ancillary databases. Each database is distributed with automatically segmented phonetic labels. These extra files were derived using the standard voice building scripts of the Festvox system. In addition to phonetic labels, the databases provide complete support for the Festival Speech Synthesis System, including pre-built voices that may be used as is. Festival and Festvox are available at
The second ’CHiME’ Speech Separation and Recognition Challenge: Datasets, tasks and baselines,” in ICASSP,
, 2013
"... Abstract Distant microphone speech recognition systems that operate with humanlike robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple comp ..."
Abstract
-
Cited by 70 (27 self)
- Add to MetaCart
(Show Context)
Abstract Distant microphone speech recognition systems that operate with humanlike robustness remain a distant goal. The key difficulty is that operating in everyday listening conditions entails processing a speech signal that is reverberantly mixed into a noise background composed of multiple competing sound sources. This paper describes a recent speech recognition evaluation that was designed to bring together researchers from multiple communities in order to foster novel approaches to this problem. The task was to identify keywords from sentences reverberantly mixed into audio backgrounds binaurally-recorded in a busy domestic environment. The challenge was designed to model the essential difficulties of multisource environment problem while remaining on a scale that would make it accessible to a wide audience. Compared to previous ASR evaluation a particular novelty of the task is that the utterances to be recognised were provided in a continuous audio background rather than as pre-segmented utterances thus allowing a range of background modelling techniques to be employed. The challenge attracted thirteen submissions. This paper describes the challenge problem, provides an overview of the systems that were entered and provides a comparison alongside both a baseline recognition system and human performance. The paper discusses insights gained from the challenge and lessons learnt for the design of future such evaluations.
Histogram Equalization of the Speech Representation for Robust Speech Recognition
, 2001
"... The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the methods proposed for robust speech recognition try to compensate the noise effect either by obtaining an estimation of the clean speech or by adapting the recognizer acoustic models for a proper modeling of the noisy speech. In this paper we propose a method to compensate the noise effect over the speech representation. This method is based on the histogram equalization technique frequently applied for Digital Image Processing, which has been adapted to the speech representation. For each component of the feature vectors representing the speech signal, the histogram is estimated and the transformation which converts it into a reference histogram is calculated. Such transformations tend to compensate the distortion the noise produces over the different components of the feature vector and improve the performance of the recognition systems under noise conditions. We describe how the histogram equalization method can be adapted to robust speech recognition and present some recognition experiments to evaluate the proposed method.
Uncertainty decoding with SPLICE for noise robust speech recognition
- In Proc. ICASSP
, 2002
"... Speech recognition front end noise removal algorithms have, in the past, estimated clean speech features from corrupted speech features. The accuracy of the noise removal process varies from frame to frame, and from dimension to dimension in the feature stream, due in part to the instantaneous SR of ..."
Abstract
-
Cited by 58 (4 self)
- Add to MetaCart
(Show Context)
Speech recognition front end noise removal algorithms have, in the past, estimated clean speech features from corrupted speech features. The accuracy of the noise removal process varies from frame to frame, and from dimension to dimension in the feature stream, due in part to the instantaneous SR of the input. In this paper, we show that localized knowledge of the accuracy of the noise removal process can be directly incorporated into the Gaussian evaluation within the decoder, to produce higher recognition accuracies. To prove this concept, we modify the SPLICE algorithm to output uncertainty information, and show that the combination of SPLICE with uncertainty decoding can remove 74.2 % of the errors in a subset of the Aurora2 task. 1.
Exemplar-based sparse representations for noise robust automatic speech recognition
, 2010
"... ..."
Discrimination of Speech from Non-speech based on Multiscale Spectrotemporal Modulations
- IEEE Transactions on Audio, Speech, and Language Processing
, 2006
"... We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task ex-plored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental soun ..."
Abstract
-
Cited by 53 (3 self)
- Add to MetaCart
(Show Context)
We describe a content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing. The task ex-plored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multi-linear dimensionality reduction technique and classified by a Support Vector Machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches [1] [2]. The results demonstrate the advantages of the auditory model over the other two systems, especially at low SNRs and high reverberation.
The CHiME corpus: a resource and a challenge for Computational Hearing in Multisource Environments
- in Proc. Interspeech’10, Makuhari
, 2010
"... We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs sp ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
(Show Context)
We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results.
Efficient voice activity detection algorithms using long-term speech information
- Speech Communication
, 2004
"... ..."
(Show Context)
Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,”
- IEEE Trans. Speech and Audio Processing,
, 2005
"... ..."