Results 1 - 10
of
13
Prosody modeling for automatic speech recognition and understanding
- in Proc. Workshop on Mathematical Foundations of Natural Language Modeling
, 2002
"... Abstract. This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word- ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Abstract. This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word-based) information. We then survey a number of applications of the framework, and give results for automatic sentence segmentation and disfluency detection, topic segmentation, dialog act labeling, and word recognition. Key words. Prosody, speech recognition and understanding, hidden Markov models. 1. Introduction. Prosody
Sector-Based Detection for Hands-Free Speech Enhancement in Cars
, 2006
"... Adaptation control of beamforming interference cancellation techniques is investigated for in-car speech acquisition. Two efficient adaptation control methods are proposed that avoid target cancellation. The “implicit ” method varies the step-size continuously, based on the filtered output signal. T ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Adaptation control of beamforming interference cancellation techniques is investigated for in-car speech acquisition. Two efficient adaptation control methods are proposed that avoid target cancellation. The “implicit ” method varies the step-size continuously, based on the filtered output signal. The “explicit” method decides in a binary manner whether to adapt or not, based on a novel estimate of target and interference energies. It estimates the average delay-sum power within a volume of space, for the same cost as the classical delay-sum. Experiments on real in-car data validate both methods, including a case with 100 km/h background road noise.
Speech Segmentation without Speech Recognition
- Proc. of ICASSP 2003
, 2003
"... In this paper, we presented a semantic speech segmentation approach, in particular sentence segmentation, without speech recognition. In order to get phoneme level information without word recognition information, a novel vowel/consonant/pause (V/C/P) classification is proposed. An adaptive pause de ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
In this paper, we presented a semantic speech segmentation approach, in particular sentence segmentation, without speech recognition. In order to get phoneme level information without word recognition information, a novel vowel/consonant/pause (V/C/P) classification is proposed. An adaptive pause detection method is also presented to adapt to various background and environment. Three feature sets, which include pause, rate of speech and prosody, are used to discriminate the sentence boundary. Experiments on broadcasting news indicate that the performance of proposed algorithm is satisfying. 1.
Prosody modeling for automatic speech understanding: an overview of recent research at SRI
- In Proc. ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding
, 2001
"... Prosody has long been studied as an important knowledge source for speech understanding. In recent years there has been a large amount of computational work aimed at prosodic ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Prosody has long been studied as an important knowledge source for speech understanding. In recent years there has been a large amount of computational work aimed at prosodic
ToBI Or NotToBI?
, 2002
"... In the decade that has passed since theintro7qbA4] o the Toe systemfo the transcriptio o pronscr speech technoFWF hasmob: o o the laboR7]bA andinto coo7:47b applicatio5 o several froral Horalb virtually nor o the co]45WbA? pro]45W have made large-scale useo probF5: Nevertheless, researchers inboF re ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In the decade that has passed since theintro7qbA4] o the Toe systemfo the transcriptio o pronscr speech technoFWF hasmob: o o the laboR7]bA andinto coo7:47b applicatio5 o several froral Horalb virtually nor o the co]45WbA? pro]45W have made large-scale useo probF5: Nevertheless, researchers inboF recoF7bA?? and synthesis cohesis to agree that betterutilizatio o proliz is essentialto improalb theperfo5:bA? and acceptabilityo coceptabi systems. In this paper, we review the current state o proteb in co:5?4bA? systems, and examineho the ohebFW discussioA relatedto what and hoto transcribe with respectto proctb have simultaneoq57 advanced and inhibited the field. In particular, we argue that, in hindsight, the Tob systemcotemb: several flaws that have limited its acceptance andapplicatio4 1.
Memory-based Robust Interpretation of Recognised Speech
- In Proceedings of SPECOM ’04, 9th International Conference ”Speech and Computer
, 2004
"... We describe a series of experiments in which memorybased machine learning techniques are used for the interpretation of spoken user input in human-machine interactions. In these experiments, the task is to determine the dialogue act of the user input and the type of information slots the user fills, ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We describe a series of experiments in which memorybased machine learning techniques are used for the interpretation of spoken user input in human-machine interactions. In these experiments, the task is to determine the dialogue act of the user input and the type of information slots the user fills, on the basis of a variety of features representing the spoken input (speech measurements and word recognition information) as well as its context (the interaction history). In the first experiment, we perform this task using the complete word graph output of the automatic speech recogniser. This yields an overall accuracy of 76.2%, with an F-score of 91.3 on dialogue act classification and an F-score of 87.7 on filled slot types. In the second experiment, we investigate the usefulness of two approaches to filtering out possibly non-contributing word recognition information from the speech recogniser output: (i) filtering out disfluencies, and (ii) keeping only syntactic chunk heads. 1.
Using Prosodic Features in Language Models for Meetings
"... Abstract. Prosody has been actively studied as an important knowledge source for speech recognition and understanding. In this paper, we are concerned with the question of exploiting prosody for language models to aid automatic speech recognition in the context of meetings. Using an automatic syllab ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. Prosody has been actively studied as an important knowledge source for speech recognition and understanding. In this paper, we are concerned with the question of exploiting prosody for language models to aid automatic speech recognition in the context of meetings. Using an automatic syllable detection algorithm, the syllable-based prosodic features are extracted to form the prosodic representation for each word. Two modeling approaches are then investigated. One is based on a factored language model, which directly uses the prosodic representation and treats it as a ‘word’. Instead of direct association, the second approach provides a richer probabilistic structure within a hierarchical Bayesian framework by introducing an intermediate latent variable to represent similar prosodic patterns shared by groups of words. Fourfold cross-validation experiments on the ICSI Meeting Corpus show that exploiting prosody for language modeling can significantly reduce the perplexity, and also have marginal reductions in word error rate. 1
Short-Term Spatio–Temporal Clustering Applied to Multiple Moving Speakers
"... Abstract—Distant microphones permit to process spontaneous multiparty speech with very little constraints on speakers, as opposed to close-talking microphones. Minimizing the constraints on speakers permits a large diversity of applications, including meeting summarization and browsing, surveillance ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract—Distant microphones permit to process spontaneous multiparty speech with very little constraints on speakers, as opposed to close-talking microphones. Minimizing the constraints on speakers permits a large diversity of applications, including meeting summarization and browsing, surveillance, hearing aids, and more natural human–machine interaction. Such applications of distant microphones require to determine where and when the speakers are talking. This is inherently a multisource problem, because of background noise sources, as well as the natural tendency of multiple speakers to talk over each other. Moreover, spontaneous speech utterances are highly discontinuous, which makes it difficult to track the multiple speakers with classical filtering approaches, such as Kalman filtering of particle filters. As an alternative, this paper proposes a probabilistic framework to determine the trajectories of multiple moving speakers in the short-term only, i.e., only while they speak. Instantaneous location estimates that are close in space and time are grouped into “short-term clusters ” in a principled manner. Each short-term cluster determines the precise start and end times of an utterance and a short-term spatial trajectory. Contrastive experiments clearly show the benefit of using short-term clustering, on real indoor recordings with seated speakers in meetings, as well as multiple moving speakers. Index Terms—Localization, multiple acoustic sources, short-term clustering, speech segmentation, tracking. I.
Automatic Sentence Structure Annotation for Spoken Language Processing
, 2008
"... Increasing amounts of easily available electronic data are precipitating a need for automatic processing
that can aid humans in digesting large amounts of data. Speech and video are becoming
an increasingly significant portion of on-line information, from news and television broadcasts, to
oral hist ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Increasing amounts of easily available electronic data are precipitating a need for automatic processing
that can aid humans in digesting large amounts of data. Speech and video are becoming
an increasingly significant portion of on-line information, from news and television broadcasts, to
oral histories, on-line lectures, or user generated content. Automatic processing of audio and video
sources requires automatic speech recognition (ASR) in order to provide transcripts. Typical ASR
generates only words, without punctuation, capitalization, or further structure. Many techniques
available from natural language processing therefore suffer when applied to speech recognition output,
because they assume the presence of reliable punctuation and structure. In addition, errors from
automatic transcription also degrade the performance of downstream processing such as machine
translation, name detection, or information retrieval. We develop approaches for automatically
annotating structure in speech, including sentence and sub-sentence segmentation, and then turn
towards optimizing ASR and annotation for downstream applications.
Learning to Identify Fragmented Words in Spoken Discourse
, 2003
"... Disfluent speech adds to the difficulty of processing spoken language utterances. ..."
Abstract
- Add to MetaCart
Disfluent speech adds to the difficulty of processing spoken language utterances.

