Results 1 - 10
of
11
Multilingual Speech Recognition
, 2000
"... The speech-to-speech translation system Verbmobil requires a multilingual setting. This consists of recognition engines in the three languages German, English and Japanese that run in one common framework together with a language identification component which is able to switch between these recogni ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
The speech-to-speech translation system Verbmobil requires a multilingual setting. This consists of recognition engines in the three languages German, English and Japanese that run in one common framework together with a language identification component which is able to switch between these recognizers. This article describes the challenges of multilingual speech recognition and presents different solutions to the problem of the automatic language identification task. The combination of the described components results in a flexible and user-friendly multilingual spoken dialog system.
Meeting Browser: Tracking And Summarizing Meetings
, 1998
"... To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provi ..."
Abstract
-
Cited by 43 (10 self)
- Add to MetaCart
To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provide rapid access to the structure and content of this human exchange. The system consists of four major components: 1.) the speech transcription engine, based on the JANUS recognition toolkit, 2.) the summarizer, a statistical tool that attempts to find salient and novel turns in the exchange, 3.) the discourse component that attempts to identify the speech acts, and 4.) the non-verbal structure, including speaker types and non-verbal visual cues. The meeting browser also attempts to identify the speech acts found in the turns of the meeting, and track topics. The browser is implemented in Java and also includes video capture of the individuals in the meeting. It attempts to identify the spea...
Multimodal Meeting Tracker
- IN PROCEEDINGS OF RIAO2000
, 2000
"... Face-to-face meetings usually encompass several modalities including speech, gesture, handwriting, and person identification. Recognition and integration of each of these modalities is important to create an accurate record of a meeting. However, each of these modalities presents recognition difficu ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
Face-to-face meetings usually encompass several modalities including speech, gesture, handwriting, and person identification. Recognition and integration of each of these modalities is important to create an accurate record of a meeting. However, each of these modalities presents recognition difficulties. Speech recognition must be speaker and domain independent, have low word error rates, and be close to real time to be useful. Gesture and handwriting recognition must be writer independent and support a wide variety of writing styles. Person identification has difficulty with segmentation in a crowded room. Furthermore, in order to produce the record automatically, we have to solve the assignment problem (who is saying what), which involves people identification and speech recognition. We follow a multimodal approach for people identification to increase the robustness (with the modules: color appearance id, face id and speaker id). This paper will examine a meeting room system under ...
Data-Driven Approach To Designing Compound Words For Continuous Speech Recognition
- IEEE Trans. Speech and Audio Processing
, 1999
"... In this paper we present a new approach to deriving compound words from a training corpus. The motivation for making compound words is because under some assumptions, errors occur less frequently in longer words. Further, they also enable more accurate modeling of pronunciation variability at the bo ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
In this paper we present a new approach to deriving compound words from a training corpus. The motivation for making compound words is because under some assumptions, errors occur less frequently in longer words. Further, they also enable more accurate modeling of pronunciation variability at the boundary between adjacent words in a continuously spoken utterance. We introduce a measure based on the product between the direct and the reverse bigram of a pair of words for finding candidate pairs in order to create compound words. Our experimental results show that by augmenting both the acoustic vocabulary and the language model with these new tokens, the word recognition accuracy can be improved by absolute 2.8% (7% relative) on a voicemail continuous speech recognition task. We also compare the proposed measure for selecting compound words with other measures that have been described in the literature. 1. INTRODUCTION One of the observations that can be made in speech recognition sys...
Modeling And Efficient Decoding Of Large Vocabulary Conversational Speech
- In Proceedings of the EUROSPEECH99
, 1999
"... Capturing the large variability of conversational speech in the framework of purely phone based speech recognizers is virtually impossible. It has been shown earlier that suprasegmental features such as speaking rate, duration and syllabic, syntactic and semantic structure are important predictors o ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
Capturing the large variability of conversational speech in the framework of purely phone based speech recognizers is virtually impossible. It has been shown earlier that suprasegmental features such as speaking rate, duration and syllabic, syntactic and semantic structure are important predictors of pronunciation variation. In order to allow for a tighter coupling of these predictors of pronunciation, duration and acoustic modeling a new recognition toolkit has been developed. The phonetic transcription of speech has been generalized to an attribute based representation, thus enabling the integration of suprasegmental, non-phonetic features. A pronunciation model is trained to augment the attribute transcription to mark possible pronunciation effects which are then taken into account by the acoustic model induction algorithm. A finite state machine single-prefix-tree, one-pass, time-synchronous decoder is presented that efficiently decodes highly spontaneous speech within this new representational framework.
Progress In Automatic Meeting Transcription
- in Proceedings of the EUROSPEECH
, 1999
"... In this paper we report recent developments on the meeting transcription task, a large vocabulary conversational speech recognition task. Previous experiments showed this is a very challenging task, with about 50% word error rate (WER) using existing recognizers. The difficulty mostly comes from hig ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
In this paper we report recent developments on the meeting transcription task, a large vocabulary conversational speech recognition task. Previous experiments showed this is a very challenging task, with about 50% word error rate (WER) using existing recognizers. The difficulty mostly comes from highly disfluent/conversational nature of meetings, and lack of domain specific training data. For the first problem, our SWB(Switchboard) system --- a conversational telephone speech recognizer --- was used to recognize wide-band meeting data; for the latter, we leveraged the large amount of Broadcast News (BN) data to build a robust system. This paper will especially focus on two experiments in the BN system development: model combination and HMM topology/duration modeling. Model combination can be done at various stages of recognition: post-processing schemes such as ROVER can lead to significant improvements; to reduce computation we tried model combination at acoustic score level. We will ...
Speech Recognition over NetMeeting Connections
, 2001
"... In this paper we evaluate the performance of the ISL's German Verbmobil spontaneous speech recognizer on the Nespole! database. In this task, people talk to an agent in a tourist office to plan their holidays via a NetMeeting connection, also sharing screen contents (web-pages). Stereo recordings we ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In this paper we evaluate the performance of the ISL's German Verbmobil spontaneous speech recognizer on the Nespole! database. In this task, people talk to an agent in a tourist office to plan their holidays via a NetMeeting connection, also sharing screen contents (web-pages). Stereo recordings were made both before and after speech transmission over an IP connection using the G.711 codec, so that we are able to directly measure the loss in LVCSR performance due to NetMeeting's segmentation and compression. The aim of this work is to quantify this loss, which is a consequence of using protocols which were not designed for speech recognition purposes. We report on techniques employed to port our existing clean-speech recognizer to this new data quality, using about 1.5h of labeled adaptation data, but avoiding a complete retraining of the system.
Hidden Model Sequence Models for Automatic Speech Recognition
, 2001
"... Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In m ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Most modern automatic speech recognition systems make use of acoustic models based on hidden Markov models. To obtain reasonable recognition performance within a large vocabulary framework, the acoustic models usually include a pronunciation model, together with complex parameter tying schemes. In many cases the pronunciation model operates on a phoneme level and is derived independently of the underlying models. In contrast, this work is aimed at improving pronunciation modelling on a sub-phone level in a combined framework. The modelling of pronunciation variation is assumed to be of special importance for recognition of spontaneous speech.
Meeting Browser: Tracking And Summarizing Meetings
, 1998
"... To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provi ..."
Abstract
- Add to MetaCart
To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provide rapid access to the structure and content of this human exchange. The system consists of four major components: 1.) the speech transcription engine, based on the JANUS recognition toolkit, 2.) the summarizer, a statistical tool that attempts to find salient and novel turns in the exchange, 3.) the discourse component that attempts to identify the speech acts, and 4.) the non-verbal structure, including speaker types and non-verbal visual cues. The meeting browser also attempts to identify the speech acts found in the turns of the meeting, and track topics. The browser is implemented in Java and also includes video capture of the individuals in the meeting. It attempts to identify the spea...
ENGINE
"... To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provi ..."
Abstract
- Add to MetaCart
To provide rapid access to meetings between human beings, transcription, tracking, retrieval and summarization of on-going human-to-human conversation has to be achieved. In DARPA and DoD sponsored work (projects GENOA and CLARITY) we aim to develop strategies to transcribe human discourse and provide rapid access to the structure and content of this human exchange. The system consists of four major components: 1.) the speech transcription engine, based on the JANUS recognition toolkit, 2.) the summarizer, a statistical tool that attempts to find salient and novel turns in the exchange, 3.) the discourse component that attempts to identify the speech acts, and 4.) the non-verbal structure, including speaker types and non-verbal visual cues. The meeting browser also attempts to identify the speech acts found in the turns of the meeting, and track topics. The browser is implemented in Java and also includes video capture of the individuals in the meeting. It attempts to identify the speakers, and their focus of attention from acoustic and visual cues.

