Results 1 - 10
of
29
Toward an Affect-Sensitive Multimodal Human-Computer Interaction
- Proceedings of the IEEE
, 2003
"... The ability to recognize affective states of a person... This paper argues that next-generation human-computer interaction (HCI) designs need to include the essence of emotional intelligence -- the ability to recognize a user's affective states -- in order to become more human-like, more effective, ..."
Abstract
-
Cited by 98 (24 self)
- Add to MetaCart
The ability to recognize affective states of a person... This paper argues that next-generation human-computer interaction (HCI) designs need to include the essence of emotional intelligence -- the ability to recognize a user's affective states -- in order to become more human-like, more effective, and more efficient. Affective arousal modulates all nonverbal communicative cues (facial expressions, body movements, and vocal and physiological reactions). In a face-to-face interaction, humans detect and interpret those interactive signals of their communicator with little or no effort. Yet design and development of an automated system that accomplishes these tasks is rather difficult. This paper surveys the past work in solving these problems by a computer and provides a set of recommendations for developing the first part of an intelligent multimodal HCI -- an automatic personalized analyzer of a user's nonverbal affective feedback.
Analysis and Synthesis of Intonation using the Tilt Model
- Journal of the Acoustical Society of America
"... This paper introduces the tilt intonational model and describes how this model can be used to automatically analyse and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterised by continuo ..."
Abstract
-
Cited by 68 (3 self)
- Add to MetaCart
This paper introduces the tilt intonational model and describes how this model can be used to automatically analyse and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterised by continuous parameters representing amplitude, duration and tilt (a measure of the shape of the event). The paper describes a event detector, in effect an intonational recognition system, which produces a transcription of an utterance's intonation. The features and parameters of the event detector are discussed and performance figures are shown on a variety of read and spontaneous speaker independent conversational speech databases. Given the event locations, algorithms are described which produce an automatic analysis of each event in terms of the Tilt parameters. Synthesis algorithms are also presented which generate F0 contours from Tilt representations. The accuracy of these is shown by comparing...
Auto-Summarization of Audio-Video Presentations
, 1999
"... As streaming audio-video technology becomes widespread, there is a dramatic increase in the amount of multimedia content available on the net. Users face a new challenge: How to examine large amounts of multimedia content quickly. One technique that can enable quick overview of multimedia is video s ..."
Abstract
-
Cited by 60 (4 self)
- Add to MetaCart
As streaming audio-video technology becomes widespread, there is a dramatic increase in the amount of multimedia content available on the net. Users face a new challenge: How to examine large amounts of multimedia content quickly. One technique that can enable quick overview of multimedia is video summaries; that is, a shorter version assembled by picking important segments from the original. We evaluate three techniques for automatic creation of summaries for online audio-video presentations. These techniques exploit information in the audio signal (e.g., pitch and pause information), knowledge of slide transition points in the presentation, and information about access patterns of previous users. We report a user study that compares automatically generated summaries that are 20%- 25% the length of full presentations to author generated summaries. Users learn from the computer-generated summaries, although less than from authors' summaries. They initially find computer-generated summ...
A Multi-Pitch Tracking Algorithm for Noisy Speech
- IEEE Transactions on Speech and Audio Processing
, 2002
"... We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency channels, and a hidden Markov model (HMM) for forming continuous p ..."
Abstract
-
Cited by 49 (11 self)
- Add to MetaCart
We present a robust algorithm for multi-pitch tracking of noisy speech. Our approach integrates an improved channel and peak selection method, a new integration method for extracting periodicity information across different frequency channels, and a hidden Markov model (HMM) for forming continuous pitch tracks, and as a result, our algorithm can reliably track single and double pitch tracks in a noisy environment. The proposed algorithm is evaluated on a database of speech utterances mixed with various interferences and the results show that our algorithm outperforms existing algorithms significantly.
The Rise/Fall/Connection Model of Intonation
, 1994
"... This paper describes a new model of intonation for English. The paper proposes that intonation can be described using a sequence of rise, fall and connection elements. Pitch accents and boundary rises are described using rise and fall elements, and connection elements are used to describe everything ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
This paper describes a new model of intonation for English. The paper proposes that intonation can be described using a sequence of rise, fall and connection elements. Pitch accents and boundary rises are described using rise and fall elements, and connection elements are used to describe everything else. Equations can be used to synthesize fundamental frequency (F 0 ) contours from these elements. An automatic labelling system is described which can derive a rise/fall/connection description from any utterance without using prior knowledge or top-down processing. Synthesis and analysis experiments are described using utterances from six speakers of various English accents. An analysis/resynthesis experiment is described which shows that the contours produced by the model are similar to within 3.6 to 7.3 Hz of the originals. An assessment of the automatic labeller shows 72% to 92% agreement between automatic and hand labels. The paper concludes with a comparison between this model and o...
Melody description and extraction in the context of music content processing
- Journal of New Music Research
, 2003
"... A huge amount of audio data is accessible to everyone by on-line or off-line information services and it is necessary to develop techniques to automatically describe and deal with this data in a meaningful way. In the particular context of music content processing it is important to take into accoun ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
A huge amount of audio data is accessible to everyone by on-line or off-line information services and it is necessary to develop techniques to automatically describe and deal with this data in a meaningful way. In the particular context of music content processing it is important to take into account the melodic aspects of the sound. The goal of this article is to review the different techniques proposed for melodic description and extraction. Some ideas around the concept of melody are first presented. Then, an overview of the different ways of describing melody is done. As a third step, an analysis of the methods proposed for melody extraction is made, including pitch detection algorithms. Finally, techniques for melodic pattern induction and matching are also studied, and some useful melodic transformations are reviewed. 1
Enhanced Pitch Tracking And The Processing Of F0 Contours For Computer Aided Intonation Teaching
- in Proceedings of the 3rd European Conference on Speech Communication and Technology
, 1993
"... A comparative evaluation of several pitch determination algorithms (PDAs) is presented. Fundamental frequency estimates, F0, are compared with laryngeal frequency estimates, Lx. An algorithm is presented which enables Lx contours to be generated from laryngograph data. We seek the most accurate me ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
A comparative evaluation of several pitch determination algorithms (PDAs) is presented. Fundamental frequency estimates, F0, are compared with laryngeal frequency estimates, Lx. An algorithm is presented which enables Lx contours to be generated from laryngograph data. We seek the most accurate method of F0 extraction in order to minimise errors propagating into subsequent prosodic analysis. The super resolution pitch determinator [3] performs well relative to the other PDAs studied. Modifications made to this algorithm are described, which radically reduce the number of gross F0 errors and improve the classification of voiced and unvoiced sections of speech. The raw F0 contours produced by this enhanced algorithm are processed to form schematised contours used in computer aided intonation teaching. The series of processes used in the schematisation is described. Keywords: Pitch tracking, Intonation, Language teaching 1 INTRODUCTION The fundamental frequency of speech plays an imp...
A Phonetic Model of English Intonation
, 1992
"... This thesis proposes a phonetic model of English intonation which is a system for linking the phonological and F 0 descriptions of an utterance. It is argued that such a model should take the form of a rigorously defined formal system which does not require any human intuition or expertise to operat ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
This thesis proposes a phonetic model of English intonation which is a system for linking the phonological and F 0 descriptions of an utterance. It is argued that such a model should take the form of a rigorously defined formal system which does not require any human intuition or expertise to operate. It is also argued that this model should be capable of both analysis (F 0 to phonology) and synthesis (phonology to F 0 ). Existing phonetic models are reviewed and it is shown that none meet the specification for the type of formal model required. A new phonetic model is presented that has three levels of description: the F 0 level, the intermediate level and the phonological level. The intermediate level uses the three basic elements of rise, fall and connection to model F 0 contours. A mathematical equation is specified for each of these elements so that a continuous F 0 contour can be created from a sequence of elements. The phonological system uses H and L to describe high and low pi...
Automatic Prosodic Analysis for Computer Aided Pronunciation Teaching
, 1994
"... Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for p ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech...

