Results 1 - 10
of
33
Traps -- Classifiers Of Temporal Patterns
- IN PROCEEDINGS OF 5TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, ICSLP 98
, 1998
"... The work proposes a radically different set of features for ASR where TempoRAl Patterns of spectral energies are used in place of the conventional spectral patterns. The approach has several inherent advantages, among them robustness to stationary or slowly varying disturbances. ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
The work proposes a radically different set of features for ASR where TempoRAl Patterns of spectral energies are used in place of the conventional spectral patterns. The approach has several inherent advantages, among them robustness to stationary or slowly varying disturbances.
Heterogeneous Acoustic Measurement And Multiple Classifiers For Speech Recognition
, 1998
"... The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
The acoustic-phonetic modeling component of most current speech recognition systems calculates a small set of homogeneous frame-based measurements at a single, #xed time-frequency resolution. This thesis presents evidence indicating that recognition performance can be signi#cantly improved through a contrasting approach using more detailed and more diverse acoustic measurements, which we refer to as heterogeneous measurements.
Temporal patterns (TRAPS) in ASR of noisy speech
- in Proc. ICASSP
, 1999
"... International Computer Science Institute, In this paper we study a new approach to processing temporal information for automatic speech recognition (ASR). Speci cally, we study the use of rather longtime TempoRAl Patterns (TRAPs) of spectral energies in place of the conventional spectral patterns fo ..."
Abstract
-
Cited by 29 (3 self)
- Add to MetaCart
International Computer Science Institute, In this paper we study a new approach to processing temporal information for automatic speech recognition (ASR). Speci cally, we study the use of rather longtime TempoRAl Patterns (TRAPs) of spectral energies in place of the conventional spectral patterns for ASR. The proposed Neural TRAPs are found to yield significant amount of complementary information to that of the conventional spectral feature based ASR system. A combination of these two ASR systems is shown to result in improved robustness to several types of additive and convolutive environmental degradations. 1.1. Spectral features 1.
Assessing Local Noise Level Estimation Methods
- SPEECH COMMUNICATION
, 1999
"... In this paper, we assess and compare two well-known methods for the local estimation of noise level in frequency subbands to a new one based of the following of lower signal energy envelope. Moreover we introduce, for those three approaches, a new pre-processing algorithm expected to better follow f ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
In this paper, we assess and compare two well-known methods for the local estimation of noise level in frequency subbands to a new one based of the following of lower signal energy envelope. Moreover we introduce, for those three approaches, a new pre-processing algorithm expected to better follow fast modulations of the noise energy. Speech periodicity property is used to update the noise level estimate during voiced parts of speech (without explicit detection of voiced portions) . This evaluation is performed on four different kinds of noise (both artificial and real noises) added to clean speech. The best approach is used for spectral subtraction in a speech recognition experiment and compared to more classical noise robust features (J-RASTA).
Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition
, 1997
"... The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio-Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still al ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
The Multi-Stream automatic speech recognition approach was investigated in this work as a framework for Audio-Visual data fusion and speech recognition. This method presents many potential advantages for such a task. It particularly allows for synchronous decoding of continuous speech while still allowing for some asynchrony of the visual and acoustic information streams. First, the Multi-Stream formalism is briefly recalled. Then, on top of the Multi-Stream motivations, experiments on the M2VTS multimodal database are presented and discussed. To our knowledge, these are the first experiments about multi-speaker continuous Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream approach can yield improved Audio-Visual speech recognition performance when the acoustic signal is corrupted by noise as well as for clean speech.
Combining Connectionist Multi-Band And Full-Band Probability Streams For Speech Recognition Of Natural Numbers
, 1998
"... Multi-band automatic speech recognition is a new and exploratory area of speech recognition which has been getting much attention in the research community. It has been shown that multiband ASR reduces word error in noisy conditions, particularly in the case of narrow band noise. In this work we sh ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Multi-band automatic speech recognition is a new and exploratory area of speech recognition which has been getting much attention in the research community. It has been shown that multiband ASR reduces word error in noisy conditions, particularly in the case of narrow band noise. In this work we show that multi-band ASR could be used to improve the speech recognition accuracy of natural numbers for clean speech when the multi-band (MB) information stream is used in addition to the full-band (FB) one. We also observe that a similar combination method significantly reduces the error rate on reverberant speech. Finally, we analyze the error patterns of the full-band and multi-band paradigms to understand why the combination of the two streams is effective.
Missing Feature Theory In ASR: Make Sure You Miss The Right Type Of Features
- Proc. Workshop on Robust Methods for Speech Recognition in Adverse Conditions
, 1999
"... In this paper we investigate acoustic backing-off as an operationalization of Missing Feature Theory to increase recognition robustness in adverse acoustic conditions. Acoustic backing-off effectively removes the detrimental influence of outlier values from the local decisions in the Viterbi algorit ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
In this paper we investigate acoustic backing-off as an operationalization of Missing Feature Theory to increase recognition robustness in adverse acoustic conditions. Acoustic backing-off effectively removes the detrimental influence of outlier values from the local decisions in the Viterbi algorithm. It does so without prior knowledge about the specific feature vector elements which are unreliable; thus, the technique avoids the need for explicit outlier detection. From the theory underlying Missing Feature Theory it appears that acoustic feature representations which smear local spectro-temporal distortions over all feature vector elements are inherently unsuitable. Our experiments in the context of connected digit recognition over the telephone are presented that confirm this prediction. Our results show that feature representations which minimize distortion smearing are most suited to be used in combination with Missing Feature Theory. Using additive band limited noise as a distor...
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
"... Despite the rapid adoption of Voice over IP (VoIP), its security implications are not yet fully understood. Since VoIP calls may traverse untrusted networks, packets should be encrypted to ensure confidentiality. However, we show that when the audio is encoded using variable bit rate codecs, the len ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Despite the rapid adoption of Voice over IP (VoIP), its security implications are not yet fully understood. Since VoIP calls may traverse untrusted networks, packets should be encrypted to ensure confidentiality. However, we show that when the audio is encoded using variable bit rate codecs, the lengths of encrypted VoIP packets can be used to identify the phrases spoken within a call. Our results indicate that a passive observer can identify phrases from a standard speech corpus within encrypted calls with an average accuracy of 50%, and with accuracy greater than 90 % for some phrases. Clearly, such an attack calls into question the efficacy of current VoIP encryption standards. In addition, we examine the impact of various features of the underlying audio on our performance and discuss methods for mitigation. 1
Using Multiple Time Scales In A Multi-Stream Speech Recognition System
, 1997
"... In this paper, we propose and investigate a new approach towards using multiple time scale information in automatic speech recognition (ASR) systems. In this framework, we are using a particular HMM formalism able to process different input streams and to recombine them at some temporal anchor point ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
In this paper, we propose and investigate a new approach towards using multiple time scale information in automatic speech recognition (ASR) systems. In this framework, we are using a particular HMM formalism able to process different input streams and to recombine them at some temporal anchor points. While the phonological level of recombination has to be defined a priori, the optimal temporal anchor points are obtained automatically during recognition. In the current approach, those parallel cooperative HMMs will focus on different dynamic properties of the speech signal, defined on different time scales. The speech signal is then defined in terms of several information streams, each stream resulting from a particular way of analyzing the speech signal. More specifically, in the current work, models aimed at capturing the syllable level temporal structure are used in parallel with classical phoneme-based models. Tests on different continuous speech databases show significant performa...
Multi-Resolution Cepstral Features for Phoneme Recognition across Speech Sub-Bands
- proceedings of Acoustics, Speech, and Signal Processing
, 1998
"... Multi-resolution sub-band cepstral features strive to exploit discriminative cues in localised regions of the spectral domain by supplementing the full bandwith cepstral features with sub-band cepstral features derived from several levels of sub-band decomposition. Mult-iresolution feature vectors, ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Multi-resolution sub-band cepstral features strive to exploit discriminative cues in localised regions of the spectral domain by supplementing the full bandwith cepstral features with sub-band cepstral features derived from several levels of sub-band decomposition. Mult-iresolution feature vectors, formed by concatenation of the subband cepstral features into an extended feature vector, are shown to yield better performance than conventional MFCCs for phoneme recognition on the TIMIT database. Possible strategies for the recombination of partial recognition scores from independent multi-resoltuion sub-band models are explored. By exploiting the sub-band variations in signal to noise ratio for linearly weighted recombination of the log likelihood probabilities we obtained improved phoneme recognition performance in broadband noise compared to MFCC features. This is an advantage over a purely sub-band approach using non linear recombination which is robust only to narrow band noise. 1

