Results 1 - 10
of
21
SRILM—An extensible language modeling toolkit
- In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002
, 2002
"... SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation ..."
Abstract
-
Cited by 449 (13 self)
- Add to MetaCart
SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools. 1.
Observations on overlap: Findings and implications for automatic processing of multi-party conversation
- Proc. EUROSPEECH
, 2001
"... We examine the distribution of overlapping speech in different corpora of natural multi-party conversations, including two types of meetings, and two corpora of telephone conversations. Analyses are based on forced alignment and speech recognition using an identical recognizer across tasks. Three re ..."
Abstract
-
Cited by 51 (10 self)
- Add to MetaCart
We examine the distribution of overlapping speech in different corpora of natural multi-party conversations, including two types of meetings, and two corpora of telephone conversations. Analyses are based on forced alignment and speech recognition using an identical recognizer across tasks. Three results are discussed. First, all corpora show high overall rates of overlap, with similar rates for meetings and telephone conversations. Second, speech recognition performance in non-overlapped regions of meetings is no worse than that in single-channel telephone conversations, while recognition in overlap regions degrades considerably. Finally, interrupt locations are associated with endpoints of word-level events in a speaker’s turn, including backchannels, discourse markers, and disfluencies. Results suggest that overlap is an important inherent characteristic of conversational speech that should not be ignored; on the contrary, it should be jointly modeled with acoustic and language model information in machine processing of conversation. 1.
Automatic disfluency identification in conversational speech using multiple knowledge sources
- In Proc. Eurospeech
, 2003
"... Disfluencies occur frequently in spontaneous speech. Detection and correction of disfluencies can make automatic speech recognition transcripts more readable for human readers, and can aid downstream processing by machine. This work investigates a number of knowledge sources for disfluency detection ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
Disfluencies occur frequently in spontaneous speech. Detection and correction of disfluencies can make automatic speech recognition transcripts more readable for human readers, and can aid downstream processing by machine. This work investigates a number of knowledge sources for disfluency detection, including acoustic-prosodic features, a language model (LM) to account for repetition patterns, a part-of-speech (POS) based LM, and rule-based knowledge. Different components are designed for different purposes in the system. Results show that detection of disfluency interruption points is best achieved by a combination of prosodic cues, word-based cues, and POS-based cues. The onset of a disfluency to be removed, in contrast, is best found using knowledge-based rules. Finally, specific disfluency types can be aided by the modeling of word patterns. 1.
Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting . . .
- IN PROC. ISCA TUTORIAL AND RESEARCH WORKSHOP ON PROSODY IN SPEECH RECOGNITION AND UNDERSTANDING (PROSODY
, 2001
"... We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground spee ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
We investigate whether probabilistic modeling of prosody can aid various automatic labeling tasks essential for processing of multi-party meetings. Task 1, automatic punctuation, seeks to classify sentence boundaries and disfluencies. Task 2, jumpin points, predicts locations within foreground speech at which background speakers start talking; Task 3, jump-in words,ex- amines characteristics of the speech they use to do so. Data are from the ICSI Meeting Recorder corpus. To infer inherent cues, analyses are based on close-talking microphone signals and recognizer forced alignments. As a generous baseline for word-level cues, we compare prosodic models to those of a language model given the true words. Results for Task 1 show prosody reduces classification error by 10% relative over the cheating language model; furthermore when this task is run in "online" mode the prosodic model degrades less than does the language model. For Task 2, the language model provides no information, while the prosodic model reduces entropy by 13% over chance. For Task 3, a prosodic model reduces entropy by 25% over chance. Analyses also show interesting prosodic patterns, which differ over tasks. Task 1 uses cues similar to those for Switchboard (but not Broadcast News) data. Task 2 predicts jump-in points that look prosodically like sentence boundaries but that are not actually such boundaries. And Task 3 shows that speakers "raise" their voice when starting during another's talk, compared to starting during silence. These results provide evidence that prosodic modeling can be of use for the automatic processing of meetings. Further results and implications for future automatic meeting processing systems are discussed.
Corrective Language Modeling For Large Vocabulary ASR With The Perceptron Algorithm
- PROC. ICASSP
, 2004
"... This paper investigates error-corrective language modeling using the perceptron algorithm on word lattices. The resulting model is encoded as a weighted finite-state automaton, and is used by intersecting the model with word lattices, making it simple and inexpensive to apply during decoding. We pre ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
This paper investigates error-corrective language modeling using the perceptron algorithm on word lattices. The resulting model is encoded as a weighted finite-state automaton, and is used by intersecting the model with word lattices, making it simple and inexpensive to apply during decoding. We present results for various training scenarios for the Switchboard task, including using ngram features of different orders, and performing n-best extraction versus using full word lattices. We demonstrate the importance of making the training conditions as close as possible to testing conditions. The best approach yields a 1.3 percent improvement in first pass accuracy, which translates to 0.5 percent improvement after other rescoring passes.
Hidden Model Sequence Models for Automatic Speech Recognition
, 2001
"... Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In m ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.
S.: Evaluating factors impacting the accuracy of forced alignments in a multimodal corpus
- In: Proc. of Language Resource and Evaluation Conference
, 2004
"... People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor’s gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in hum ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
People, when processing human-to-human communication, utilize everything they can in order to understand that communication, including speech and information such as the time and location of an interlocutor’s gesture and gaze. Speech and gesture are known to exhibit a synchronous relationship in human communication; however, the precise nature of that relationship requires further investigation. The construction of computer models of multimodal human communication would be enabled by the availability of multimodal communication corpora annotated with synchronized gesture and speech features. To investigate the temporal relationships of these knowledge sources, we have collected and are annotating several multimodal corpora with time-aligned features. Forced alignment between a speech file and its transcription is a crucial part of multimodal corpus production. This paper investigates a number of factors that may contribute to highly accurate forced alignments to support the rapid production of these multimodal corpora including the acoustic model, the match between the speech used for training the system and that to be force aligned, the amount of data used to train the ASR system, the availability of speaker adaptation, and the duration of alignment segments. 1.
PanDoRA: A Large-scale Two-way Statistical Machine Translation System for Hand-held Devices
- In the Proceedings of MT Summit XI
"... The statistical machine translation (SMT) approach has taken a lead place in the field of Machine Translation for its better translation quality and lower cost in training compared to other approaches. However, due to the high demand of computing resources, an SMT system can not be directly run on h ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The statistical machine translation (SMT) approach has taken a lead place in the field of Machine Translation for its better translation quality and lower cost in training compared to other approaches. However, due to the high demand of computing resources, an SMT system can not be directly run on hand-held devices. Most existing hand-held translation systems are either interlingua-based, which require non-trivial human efforts to write grammar rules, or using the client/server architecture, which are constrained by the availability of wireless connections. In this paper we present PanDoRA, a two-way phrase-based statistical machine translation system for stand-alone hand-held devices. Powered by special designs such as integerized computation and compact data structure, PanDoRA can translate dialogue speech on off-the-shelf PDAs in real time. PanDoRA uses 64K words vocabulary and millions of phrase pairs for each translation directions. To our knowledge, PanDoRA is the first large-scale SMT system with build-in reordering models running on hand-held devices. We have successfully developed several speech-to-speech translation systems using PanDoRA and our experiments show that PanDoRA's translation quality is comparable to that of the state-of-the-art phrase-based statistical machine translation systems such as Pharaoh and STTK.
Prosodic Cues For Emotion Recognition In Communicator Dialogs
, 2002
"... this report) as described in [11]. See [11] for further reference on Kappa and significance ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
this report) as described in [11]. See [11] for further reference on Kappa and significance

