Results 1 - 10
of
11
Automatic Prosodic Prominence Detection in Speech Using Acoustic Features: an Unsupervised System
- In Proceedings of Eurospeech 2003
, 2003
"... This paper presents work in progress on the automatic detection of prosodic prominence in continuous speech. Prosodic prominence involves two different phonetic features: pitch accents, connected with fundamental frequency (F0) movements and syllable overall energy, and stress, which exhibits a stro ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This paper presents work in progress on the automatic detection of prosodic prominence in continuous speech. Prosodic prominence involves two different phonetic features: pitch accents, connected with fundamental frequency (F0) movements and syllable overall energy, and stress, which exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. By measuring these acoustic parameters it is possible to build an automatic system capable of correctly identifying prominent syllables with an agreement, with human-tagged data, comparable with the inter-human agreement reported in the literature. This system does not require any training phase, additional information or annotation, it is not tailored to a specific set of data and can be easily adapted to different languages.
Prosodic Prominence Detection in Speech
, 2003
"... This paper presents work in progress on the automatic detection of prosodic prominence in continuous speech. Prosodic prominence involves two different phonetic features: pitch accents, connected with fundamental frequency (F0) movements and syllable overall energy, and stress, which exhibits a stro ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper presents work in progress on the automatic detection of prosodic prominence in continuous speech. Prosodic prominence involves two different phonetic features: pitch accents, connected with fundamental frequency (F0) movements and syllable overall energy, and stress, which exhibits a strong correlation with syllable nuclei duration and high-frequency emphasis. By measuring these acoustic parameters it is possible to build an automatic system capable of correctly identifying prominent syllables with an agreement with human-tagged data comparable with the inter-human agreement reported in the literature. These results were achieved without using any information apart from acoustic parameters.
Using Prosodic Features in Language Models for Meetings
"... Abstract. Prosody has been actively studied as an important knowledge source for speech recognition and understanding. In this paper, we are concerned with the question of exploiting prosody for language models to aid automatic speech recognition in the context of meetings. Using an automatic syllab ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. Prosody has been actively studied as an important knowledge source for speech recognition and understanding. In this paper, we are concerned with the question of exploiting prosody for language models to aid automatic speech recognition in the context of meetings. Using an automatic syllable detection algorithm, the syllable-based prosodic features are extracted to form the prosodic representation for each word. Two modeling approaches are then investigated. One is based on a factored language model, which directly uses the prosodic representation and treats it as a ‘word’. Instead of direct association, the second approach provides a richer probabilistic structure within a hierarchical Bayesian framework by introducing an intermediate latent variable to represent similar prosodic patterns shared by groups of words. Fourfold cross-validation experiments on the ICSI Meeting Corpus show that exploiting prosody for language modeling can significantly reduce the perplexity, and also have marginal reductions in word error rate. 1
Speech Recognition Using Acoustic Landmarks and Binary Phonetic Feature Classifiers
, 2003
"... In spite of decades of research, Automatic Speech Recognition (ASR) is far from reaching the goal of performance close to Human Speech Recognition (HSR). One of the reasons for unsatisfactory performance of the state-of-the-art ASR systems, that are based largely on Hidden Markov Models (HMMs), i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In spite of decades of research, Automatic Speech Recognition (ASR) is far from reaching the goal of performance close to Human Speech Recognition (HSR). One of the reasons for unsatisfactory performance of the state-of-the-art ASR systems, that are based largely on Hidden Markov Models (HMMs), is the inferior acoustic modeling of low level or phonetic level linguistic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal. But an acoustic phonetic system that carries out large ASR speech recognition tasks, for example, connected word or continuous speech recognition, does not exist. We propose a probabilistic and statistical framework for ASR based on the knowledge of acoustic phonetics for connected word ASR. The proposed system is based on the idea of representation of speech sounds by bundles of binary valued articulatory phonetic features. The probabilistic framework requires only binary classifiers of phonetic features and the knowledge based acoustic correlates of the features for the purpose of connected word speech recognition. We explore the use of Support Vector Machines (SVMs) for binary phonetic feature classification because of the favorable properties well suited to our recognition task that SVMs o#er. In the proposed method, probabilistic segmentation of speech is obtained using SVM based classifiers of manner phonetic features. The linguistically motivated landmarks obtained in each segmentation is used for classification of source and place phonetic features. Probabilistic segmentation paths are constrained using Finite State Automata (FSA) for isolated or connected word recognition. The proposed method could overcome the disadvantages ...
Automatic Annotation of Speech Corpora for Prosodic Prominence
- In Proc. LREC-CPSLC workshop
, 2004
"... This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Perceptual prosodic prominence is supported by two different prosodic features: pitch accent a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Perceptual prosodic prominence is supported by two different prosodic features: pitch accent and stress. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic phenomena, makes it possible to build automatic systems capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature without using any kind of information apart the acoustic parameters derived directly from speech waveforms.
PHONETIC FEATURES AND ACOUSTIC LANDMARKS
"... A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phonetic approach, the speech recognition problem is hypothesized as a maximization of the joint posterior probability of a set of pho ..."
Abstract
- Add to MetaCart
A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phonetic approach, the speech recognition problem is hypothesized as a maximization of the joint posterior probability of a set of phonetic features and the corresponding acoustic landmarks. Binary classifiers of the manner phonetic features- syllabic, sonorant and continuant- are applied for the probabilistic detection of speech landmarks. The landmarks include stop bursts, vowel onsets, syllabic peaks, syllabic dips, fricative onsets and offsets, and sonorant consonant onsets and offsets. The classifiers use automatically extracted knowledge based acoustic parameters (APs) that are acoustic correlates of those phonetic features. For isolated word recognition with known and limited vocabulary, the landmark sequences are constrained using a manner class pronunciation graph. Probabilistic decisions on place and voicing phonetic features are then made using a separate set of APs extracted using the landmarks. The framework exploits two properties of the knowledge-based acoustic cuesof phonetic features: (1) sufficiency of the acoustic cues of a phonetic feature for a decision on that feature and (2) invariance of the acoustic cues with respect to context. The probabilistic framework makes the acoustic-phonetic approach to speech recognition suitable for practical recognition tasks as well as compatible with probabilistic pronunciation and language models. Support vector machines (SVMs) are applied for the binary classification tasks because of their two favorable properties- good generalization and the ability to learn from a relatively small amount of high dimensional data. Performance comparable to Hidden Markov Model (HMM) based systems is obtained on landmark detection as well as isolated word recognition. Applications to rescoring of lattices from a large vocabulary continuous speech recognizer are also presented. SPEECH RECOGNITION BASED ON
Robust Acoustic-Based Syllable Detection
"... In this paper, we describe a method to detect syllabic nuclei in continuous speech. It employs two basic and robust acoustic features, periodicity and energy, to detect syllable landmarks. This method is evaluated on TIMIT, noise additive TIMIT and NTIMIT datasets with typical total error rates of a ..."
Abstract
- Add to MetaCart
In this paper, we describe a method to detect syllabic nuclei in continuous speech. It employs two basic and robust acoustic features, periodicity and energy, to detect syllable landmarks. This method is evaluated on TIMIT, noise additive TIMIT and NTIMIT datasets with typical total error rates of around 30 % in all the datasets, except for extremely adverse 0dB signal-noise-ratio environments, while HMM-based systems degrade rigorously. Based on the landmarks, a vowel classifier is further constructed and achieves the same performance as HMM-based systems. Index Terms: syllable detection, robustness, vowel classification. 1.
A 3: HCI Coding Guideline for Research Using Video Annotation to Assess Behavior of Nonverbal Subjects with Computer-Based Intervention 8
"... HCI studies assessing nonverbal individuals (especially those who do not communicate through traditional linguistic means: spoken, written, or sign) are a daunting undertaking. Without the use of directed tasks, interviews, questionnaires, or question-answer sessions, researchers must rely fully upo ..."
Abstract
- Add to MetaCart
HCI studies assessing nonverbal individuals (especially those who do not communicate through traditional linguistic means: spoken, written, or sign) are a daunting undertaking. Without the use of directed tasks, interviews, questionnaires, or question-answer sessions, researchers must rely fully upon observation of behavior, and the categorization and quantification of the participant’s actions. This problem is compounded further by the lack of metrics to quantify the behavior of nonverbal subjects in computer-based intervention contexts. We present a set of dependent variables called A3 (pronounced A-Cubed) or Annotation for ASD Analysis, to assess the behavior of this demographic of users, specifically focusing on engagement and vocalization. This paper demonstrates how theory from multiple disciplines can be brought together to create a set of dependent variables, as well as demonstration of these variables, in an experimental context. Through an examination of the existing literature, and a detailed analysis of the current state of computer vision and speech detection, we present how computer automation may be integrated with the A3 guidelines to reduce coding time and potentially increase accuracy. We conclude by presenting how and where these variables can be used in multiple research areas and with varied target populations.
SYLLABIFICATION OF CONVERSATIONAL SPEECH USING BIDIRECTIONAL LONG-SHORT-TERM MEMORY NEURAL NETWORKS
"... Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually mo ..."
Abstract
- Add to MetaCart
Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases—Switchboard and TIMIT—for both read and spontaneous speech, and a favourable comparison with other published results.

