Results 1 - 10
of
17
Speech Recognition in Mobile Environments
, 2000
"... The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Classic robustness techniques that have been previously proposed for speech recognition yield limited improvements o ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Classic robustness techniques that have been previously proposed for speech recognition yield limited improvements of the degradation introduced by idiosyncrasies of the mobile networks. These sources of degradation include distortion introduced by the speech codec as well as artifacts arising from channel errors and discontinuous transmission. In this thesis we focus on characterizing the distortion introduced to the speech signal by the speech codec and we propose methods for reducing the detrimental effect of coding on recognition accuracy. The initial focus of this thesis is on the full rate GSM codec (FRGSM) . We propose a method to generate recognition features directly from codec parameters. It is shown in this work that by selectively constructing a cepstral feature vector from the GSM codec para...
Graceful Degradation of Speech Recognition Performance over Packet-Erasure Networks
- IEEE Trans. On Speech and Audio Processing
, 2002
"... This paper explores packet loss recovery for automatic speech recognition (ASR) in spoken dialog systems, assuming an architecture in which a lightweight client communicates with a remote ASR server. Speech is transmitted with source and channel codes optimized for the ASR application, i.e., to mini ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper explores packet loss recovery for automatic speech recognition (ASR) in spoken dialog systems, assuming an architecture in which a lightweight client communicates with a remote ASR server. Speech is transmitted with source and channel codes optimized for the ASR application, i.e., to minimize word error rate. Unequal amounts of forward error correction, depending on the data's effect on ASR performance, are assigned to protect against packet loss. Experiments with simulated packet loss in a range of loss conditions are conducted on the DARPA Communicator (air travel information) task. Results show that the approach provides robust ASR performance which degrades gracefully as packet loss rates increase. Transmitting at 5.2 Kbps with up to 200 ms added delay, leads to only a 7% relative degradation in word error rate even under extremely adverse network conditions.
Low-Bitrate Distributed Speech Recognition for Packet-Based and Wireless Communication
- IEEE Transactions on Speech and Audio Processing
, 2002
"... In this paper, we present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packetbased) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channe ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this paper, we present a framework for developing source coding, channel coding and decoding as well as erasure concealment techniques adapted for distributed (wireless or packetbased) speech recognition. It is shown that speech recognition as opposed to speech coding, is more sensitive to channel errors than channel erasures, and appropriate channel coding design criteria are determined. For channel decoding, we introduce a novel technique for combining at the receiver soft decision decoding with error detection. Frame erasure concealment techniques are used at the decoder to deal with unreliable frames. At the recognition stage, we present a technique to modify the recognition engine itself to take into account the time-varying reliability of the decoded feature after channel transmission. The resulting engine, referred to as weighted Viterbi recognition, further improves recognition accuracy. Together, source coding, channel coding and the modified recognition engine are shown to provide good recognition accuracy over a wide range of communication channels with bitrates of 1.2 kbps or less.
Efficient Scalable Encoding for Distributed Speech Recognition
- IEEE Transactions on Speech and Audio Processing, Submitted
, 2003
"... In this paper the remote speech recognition problem is addressed. Speech features are extracted at a client and transmitted to a remote recognizer. This enables a low complexity client, which does not have the computational and memory resources to host a complex speech recognizer, to make use of dis ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
In this paper the remote speech recognition problem is addressed. Speech features are extracted at a client and transmitted to a remote recognizer. This enables a low complexity client, which does not have the computational and memory resources to host a complex speech recognizer, to make use of distributed resources to provide speech recognition services to the user. The novelties of the proposed work are (i) the extracted features are compressed using scalable encoding techniques providing a multi-resolution bitstream, (ii) a complete scalable distributed speech recognition (DSR) system is presented wherein the proposed scalable encoding technique is combined with a scalable recognition system. The scalable DSR system provides successive approximation in terms of recognition performance, (i.e., as additional bits are transmitted the recognition can be refined to improve the performance) and achieves both bandwidth and complexity (latency) reductions. The proposed encoding schemes are well suited to be implemented on light-weight mobile devices where varying ambient conditions and limited computational capabilities pose a severe constraint in achieving good recognition performance. The scalable DSR system is capable of adapting to the varying network, system and user constraints by operating at the "right" trade-off point between transmission rate, recognition performance and complexity to provide good quality of service (QoS) to the user. The system was tested using two case studies. In the first, the scalable encoder along with a dynamic time warping-hidden Markov model (DTW-HMM) system reduced the recognition complexity by 25% compared to a system using only a HMM, with no degradation in word error rate (WER). In the second study, a distributed two-...
Scalable Distributed Speech Recognition Using Multi-Frame GMM-Based Block Quantization
"... In this paper, we propose the use of the multi-frame Gaussian mixture model-based block quantizer for the coding of Mel frequencywarped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. This coding scheme exploits intraframe correlation via the Karhunen-Lo ev ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In this paper, we propose the use of the multi-frame Gaussian mixture model-based block quantizer for the coding of Mel frequencywarped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. This coding scheme exploits intraframe correlation via the Karhunen-Lo eve transform (KLT) and interframe correlation via the joint processing of adjacent frames together with the computational simplicity of scalar quantization. The proposed coder is bit-rate scalable, which means that the bitrate can be adjusted without the need for re-training of the quantizers. Static parameters such as the probability density function (PDF) model and KLT orthogonal matrices are stored at the encoder and decoder and bit allocations are calculated `on-the-fly' without intensive processing. This coding scheme is evaluated in this paper on the Aurora-2 database in a DSR framework. It is shown that this coding scheme achieves high recognition performance at lower bitrates, with a word error rate (WER) of 2.5% at 800 bps, which is less than 1% degradation from the baseline word recognition accuracy, and graceful degradation down to a WER of 7% at 300 bps.
Energy aware distributed speech recognition for wireless mobile devices,” Hewlett Packard Laboratories
- IEEE Design and Test of Computers: Special Issue on Embedded Systems for Real-Time Multimedia
, 2004
"... low-power, distributed speech recognition, wireless The use of a voice-user interface for mobile wireless devices has been an area of interest for some time. However, these devices are generally limited by computation, memory, and battery energy, so performing high quality speech recognition on an e ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
low-power, distributed speech recognition, wireless The use of a voice-user interface for mobile wireless devices has been an area of interest for some time. However, these devices are generally limited by computation, memory, and battery energy, so performing high quality speech recognition on an embedded device is a difficult challenge. In this paper, we investigate the energy consumption of distributed speech recognition (DSR) on the HP Labs SmartBadge IV embedded system and propose optimizations at both the application and network layers that reduce the overall energy budget for this application while still maintaining adequate quality of service for the end-user. We consider energy consumption in both computation and communication. We present software optimization techniques that reduce the energy consumption of the speech signal processing algorithm by 83%. In addition, we estimate the energy consumption of client-side automatic speech recognition without the use of the network. We present a range of results such that the upper bound may match the results of serverbased
Joint Channel Decoding - Viterbi Recognition for Wireless Applications
- in Proceedings of Eurospeech
, 2001
"... We introduce the concept of joint channel decoding and Viterbi recognition, by which the Viterbi recognizer is modified to take into account the confidence in the decoded feature after channel transmission. We present a metric for evaluating such confidence based on soft decision decoding. As a case ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We introduce the concept of joint channel decoding and Viterbi recognition, by which the Viterbi recognizer is modified to take into account the confidence in the decoded feature after channel transmission. We present a metric for evaluating such confidence based on soft decision decoding. As a case study, we quantize MFCCs using predictive VQ. The overall sourcechannel coding scheme operating at a combined rate of 1 kbps is shown to provide good recognition accuracy over a wide range of Rayleigh fading channels.
Automatic speech recognition over error-prone wireless networks
- Speech Communication
, 2005
"... The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognit ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classified into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection.
Use of model transformations for distributed speech recognition
- in ISCA ITR-Workshop 2001 (Adaptation Methods for Speech Recognition), (Sophia-Antipolis
, 2001
"... Due to bandwidth limitations, the speech recognizer in distributed speech recognition (DSR) applications has to use encoded speech – either traditional speech encoding or speech encoding optimized for recognition. The penalty incurred in reducing the bitrate is degradation in speech recognition perf ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Due to bandwidth limitations, the speech recognizer in distributed speech recognition (DSR) applications has to use encoded speech – either traditional speech encoding or speech encoding optimized for recognition. The penalty incurred in reducing the bitrate is degradation in speech recognition performance. The diversity of the applications using DSR implies that a variety of speech encoders can be used to compress speech. By treating the encoder variability as a mismatch we propose using model transformation to reduce the speech recognition performance degradation. The advantage of using model transformation is that only a single model set needs to be trained at the server, which can be adapted on the fly to the input speech data. We were able to reduce the word error rate by 61.9 %, 63.3 % and 56.3 % for MELP, GSM and MFCC-encoded data, respectively, by using MAP adaptation, which shows the generality of our proposed scheme. 1.
Multi-frame GMM-based block quantisation of line spectral frequencies
- in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing
, 2005
"... In this paper. we investigate the use of the Gaussian mixture model-based block quantiser for coding line spectral frequencies that uses multiple frames and mean squared error as the quantiser selection criterion. As a viable alternative to vector quantisers, the GMM-based block quantiser encompasse ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper. we investigate the use of the Gaussian mixture model-based block quantiser for coding line spectral frequencies that uses multiple frames and mean squared error as the quantiser selection criterion. As a viable alternative to vector quantisers, the GMM-based block quantiser encompasses both low computational and memory requirements as well as bitrate scalability. Jointly quantising multiple frames allows the exploitation of correlation across successive frames which leads to more efficient block quantisers. The efficiency gained from joint quantisation permits the use of the mean squared error distortion criterion for cluster quatiser selection, rather than the computationally expensive spectral distortion. The distortion performance gains come at the cost of an increase in computational complexity and memory. Experiments on narrowband speech from the TIMIT database demonstrate that the multi-frame GMM-based block quantiser can achieve a spectral distortion of 1 dB at 22bits/frame, or 21bits/frame with some added complexity.

