Results 1 - 10
of
13
Weighted Finite-State Transducers in Speech Recognition
, 2001
"... We survey the use of weighted finite-state transducers (WFSTs) in speech recognition. We show that WFSTs provide a common and natural representation for HMM models, context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs. Furthermore, general transducer oper ..."
Abstract
-
Cited by 101 (3 self)
- Add to MetaCart
We survey the use of weighted finite-state transducers (WFSTs) in speech recognition. We show that WFSTs provide a common and natural representation for HMM models, context-dependency, pronunciation dictionaries, grammars, and alternative recognition outputs. Furthermore, general transducer operations combine these representations flexibly and efficiently. Weighted
An Efficient Algorithm for the n-Best-Strings Problem
- In Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP ’02
, 2002
"... problem in a weighted automaton. This problem arises commonly in speech recognition applications when a ranked list of unique recognizer hypotheses is desired. We believe this is the first n-best algorithm to remove redundant hypotheses before rather than after the n-best determination. We give a de ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
problem in a weighted automaton. This problem arises commonly in speech recognition applications when a ranked list of unique recognizer hypotheses is desired. We believe this is the first n-best algorithm to remove redundant hypotheses before rather than after the n-best determination. We give a detailed description of the algorithm and demonstrate its correctness. We report experimental results showing its efficiency and practicality even for large n in a 40; 000-word vocabulary North American Business News (NAB) task. In particular, we show that 1000-best generation in this task requires negligible added time over recognizer lattice generation.
A Weight Pushing Algorithm for Large Vocabulary Speech Recognition
- IN EUROPEAN CONF. ON SPEECH COMMUNICATION AND TECHNOLOGY
, 2001
"... Weighted finite-state transducers provide a general framework for the representation of the components of speech recognition systems; language models, pronunciation dictionaries, contextdependent models, HMM-level acoustic models, and the output word or phone lattices can all be represented by weigh ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
Weighted finite-state transducers provide a general framework for the representation of the components of speech recognition systems; language models, pronunciation dictionaries, contextdependent models, HMM-level acoustic models, and the output word or phone lattices can all be represented by weighted automata and transducers. In general, a representation is not unique and there may be different weighted transducers realizing the same mapping. In particular, even when they have exactly the same topology with the same input and output labels, two equivalent transducers may differ by the way the weights are distributed along each path. We present
Generalized Optimization Algorithm for Speech Recognition Transducers
- IN PROCEEDINGS OF THE 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP 2003
, 2003
"... Weighted transducers provide a common representation for the components of a speech recognition system. In previous work, we showed that these components can be combined off-line into a single compact recognition transducer that maps directly HMM state sequences to word sequences [11]. The construct ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Weighted transducers provide a common representation for the components of a speech recognition system. In previous work, we showed that these components can be combined off-line into a single compact recognition transducer that maps directly HMM state sequences to word sequences [11]. The construction of that recognition transducer and its efficiency of use critically depend on the use of a general optimization algorithm, determinization. However, not all weighted automata and transducers used in largevocabulary speech recognition are determinizable. We present a general algorithm that can make an arbitrary weighted transducer determinizable and generalize our previous optimization technique for building an integrated recognition transducer to deal with arbitrary weighted transducers used in speech recognition. We report experimental results in a large-vocabulary speech recognition task, How May I Help You (HMIHY), showing that our generalized technique leads to a recognition transducer that performs as well as our original solution in the case of classical n-gram models while inserting less special symbols, and that it leads to a substantial improvement of the recognition speed, factor of 2.6, in the same task when using a class-based language model.
GENERALIZED OPTIMIZATION ALGORITHM FOR SPEECH RECOGNITION TRANSDUCERS
"... Weighted transducers provide a common representation for the components of a speech recognition system. In previous work, we showed that these components can be combined off-line into a single compact recognition transducer that maps directly HMM state sequences to word sequences [11]. The construct ..."
Abstract
- Add to MetaCart
Weighted transducers provide a common representation for the components of a speech recognition system. In previous work, we showed that these components can be combined off-line into a single compact recognition transducer that maps directly HMM state sequences to word sequences [11]. The construction of that recognition transducer and its efficiency of use critically depend on the use of a general optimization algorithm, determinization. However, not all weighted automata and transducers used in largevocabulary speech recognition are determinizable. We present a general algorithm that can make an arbitrary weighted transducer determinizable and generalize our previous optimization technique for building an integrated recognition transducer to deal with arbitrary weighted transducers used in speech recognition. We report experimental results in a large-vocabulary speech recognition task, How May I Help You (HMIHY), showing that our generalized technique leads to a recognition transducer that performs as well as our original solution in the case of classical -gram models while inserting less special symbols, and that it leads to a substantial improvement of the recognition speed, factor of ¡£ ¢ ¤ , in the same task when using a class-based language model. 1.
Towards a Unified Framework for Sub-lexical and Supra-lexical Linguistic Modeling
, 2002
"... Conversational interfaces have received much attention as a promising natural communication channel between humans and computers. A typical conversational interface consists of three major systems: speech understanding, dialog management and spoken language generation. In such a conversational inter ..."
Abstract
- Add to MetaCart
Conversational interfaces have received much attention as a promising natural communication channel between humans and computers. A typical conversational interface consists of three major systems: speech understanding, dialog management and spoken language generation. In such a conversational interface, speech recognition as the front-end of speech understanding remains to be one of the fundamental challenges for establishing robust and effective human/computer communications. On the one hand, the speech recognition component in a conversational interface lives in a rich system environment. Diverse sources of knowledge are available and can potentially be beneficial to its robustness and accuracy. For example, the natural language understanding component can provide linguistic knowledge in syntax and semantics that helps constrain the recognition search space. On the other hand, the speech recognition component also faces the challenge of spontaneous speech, and it is important to address the casualness of speech using the knowledge sources available. For example, sub-lexical linguistic information would be very useful in providing linguistic support for previously unseen words, and dynamic reliability modeling may help improve recognition robustness for poorly articulated speech.
SpeechRecogni433 wic Dynami GrammarsUsim Fimars33478 Transducers
, 2003
"... languagesua ems ranging from interactive voice rese -G2 (IVR) to mixed-initiative conversOflGfi[E ssOflG make us e of a wide range of recognition grammars and vocabularies The recognition grammars are eithersG23E (created at des2 n time) or dynamic (dependent on databas lookup at run time). This pap ..."
Abstract
- Add to MetaCart
languagesua ems ranging from interactive voice rese -G2 (IVR) to mixed-initiative conversOflGfi[E ssOflG make us e of a wide range of recognition grammars and vocabularies The recognition grammars are eithersG23E (created at des2 n time) or dynamic (dependent on databas lookup at run time). This paper examines the compilation of recognition grammars with anemphas s on the dynamic (changing) properties of the grammar and how thes relate to context-dependentspend recogni zers Bycas ing the problem in the algebra of finite-s ate trans ducers (FSTs we can us e the compos ition operator for fa s-and-efficient compilation and sG icing of dynamic recognition grammars within the context of a larger precompileds tatic grammar.
The Design Principles and Algorithms of a General Weighted Grammar Library
"... We present the software design principles, algorithms, and utilities of a general weighted grammar library, the GRM Library, that can be used in a variety of applications in text, speech, and biosequence processing. Several of the algorithms and utilities of this library are described, including in ..."
Abstract
- Add to MetaCart
We present the software design principles, algorithms, and utilities of a general weighted grammar library, the GRM Library, that can be used in a variety of applications in text, speech, and biosequence processing. Several of the algorithms and utilities of this library are described, including in some cases their pseudocodes and pointers to their use in applications. The algorithms and the utilities were designed to support a wide variety of semirings and the representation and use of large grammars and automata of several hundred million rules or transitions.
1 Beamforming with a Maximum Negentropy Criterion
"... Abstract — In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe can ..."
Abstract
- Add to MetaCart
Abstract — In this paper, we address a beamforming application based on the capture of far-field speech data from a single speaker in a real meeting room. After the position of the speaker is estimated by a speaker tracking system, we construct a subband-domain beamformer in generalized sidelobe canceller (GSC) configuration. In contrast to conventional practice, we then optimize the active weight vectors of the GSC so as to obtain an output signal with maximum negentropy (MN). This implies the beamformer output should be as non-Gaussian as possible. For calculating negentropy, we consider the Γ and the generalized Gaussian (GG) pdfs. After MN beamforming, Zelinski postfiltering is performed to further enhance the speech by removing residual noise. Our beamforming algorithm can suppress noise and reverberation without the signal cancellation problems encountered in the conventional beamforming algorithms. We demonstrate this fact through a set of acoustic simulations. Moreover, we show the effectiveness of our proposed technique through a series of far-field automatic speech recognition experiments on the Multi-Channel Wall Street Journal Audio Visual Corpus (MC-WSJ-AV), a corpus of data captured with real far-field sensors, in a realistic acoustic environment, and spoken by real speakers. On the MC-WSJ-AV evaluation data, the delay-and-sum beamformer with post-filtering achieved a word error rate (WER) of 16.5%. MN beamforming with the Γ pdf achieved a 15.8 % WER, which was further reduced to 13.2 % with the GG pdf, whereas the simple delay-and-sum beamformer provided a WER of 17.8%. To the best of our knowledge, no lower error rates at present have been reported in the literature on this ASR task. Index Terms — microphone arrays, beamforming, speech recognition, speech enhancement, source separation I.
Filter Bank Design for Subband Adaptive Beamforming and Application to Speech Recognition
, 2008
"... Abstract. e present a new filter bank design method for subband adaptive beamforming. Filter bank design for adaptive filtering poses many problems not encountered in more traditional applications such as subband coding of speech or music. The popular class of perfect reconstruction filter banks is ..."
Abstract
- Add to MetaCart
Abstract. e present a new filter bank design method for subband adaptive beamforming. Filter bank design for adaptive filtering poses many problems not encountered in more traditional applications such as subband coding of speech or music. The popular class of perfect reconstruction filter banks is not well-suited for applications involving adaptive filtering because perfect reconstruction is achieved through alias cancellation, which functions correctly only if the outputs of individual subbands are not subject to arbitrary magnitude scaling and phase shifts. In this work, we design analysis and synthesis prototypes for modulated filter banks so as to minimize each aliasing term individually. We then show that the total response error can be driven to zero by constraining the analysis and synthesis prototypes to be Nyquist(M) filters. We show that the proposed filter banks are more robust for aliasing caused by adaptive beamforming than conventional methods. Furthermore, we demonstrate the effectiveness of our design technique through a set of automatic speech recognition experiments on the multi-channel, farfield speech data from the PASCAL Speech Separation Challenge. In our system, speech signals are first transformed into the subband domain with the proposed filter banks, and thereafter the subband components are processed with a beamforming algorithm. Following beamforming,

