Results 1 - 10
of
85
Universal Onset Detection with Bidirectional Long ShortTerm Memory
- Neural Networks,” 11 th International Society for Music Information Retrieval Conference (ISMIR 2010
, 2010
"... Many different onset detection methods have been proposed in recent years. However those that perform well tend to be highly specialised for certain types of music, while those that are more widely applicable give only moderate performance. In this paper we present a new onset detector with superior ..."
Abstract
-
Cited by 33 (17 self)
- Add to MetaCart
(Show Context)
Many different onset detection methods have been proposed in recent years. However those that perform well tend to be highly specialised for certain types of music, while those that are more widely applicable give only moderate performance. In this paper we present a new onset detector with superior performance and temporal precision for all kinds of music, including complex music mixes. It is based on auditory spectral features and relative spectral differences processed by a bidirectional Long Short-Term Memory recurrent neural network, which acts as reduction function. The network is trained with a large database of onset data covering various genres and onset types. Due to the data driven nature, our approach does not require the onset detection method and its parameters to be tuned to a particular type of music. We compare results on the Bello onset data set and can conclude that our approach is on par with related results on the same set and outperforms them in most cases in terms of F1-measure. For complex music with mixed onset types, an absolute improvement of 3.6% is reported. 1.
The SEMAINE API: Towards a standards-based framework for building emotion-oriented systems
"... This paper presents the SEMAINE API, an open source framework for building emotion-oriented systems. By encouraging and simplifying the use of standard representation formats, the framework aims to contribute to interoperability and reuse of system components in the research community. By providing ..."
Abstract
-
Cited by 31 (8 self)
- Add to MetaCart
This paper presents the SEMAINE API, an open source framework for building emotion-oriented systems. By encouraging and simplifying the use of standard representation formats, the framework aims to contribute to interoperability and reuse of system components in the research community. By providing a Java and C++ wrapper around a message-oriented middleware, the API makes it easy to integrate components running on different operating systems and written in different programming languages. The SEMAINE system 1.0 is presented as an example of a full-scale system built on top of the SEMAINE API. Three small example systems are described in detail to illustrate how integration between existing and new components is realised with minimal effort.
Acoustic Emotion Recognition: A Benchmark Comparison of Performances
- In Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE
, 2009
"... Abstract—In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of Hidden Markov Models and supras ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
(Show Context)
Abstract—In the light of the first challenge on emotion recognition from speech we provide the largest-to-date benchmark comparison under equal conditions on nine standard corpora in the field using the two pre-dominant paradigms: modeling on a frame-level by means of Hidden Markov Models and suprasegmental modeling by systematic feature brute-forcing. Investigated corpora are the ABC, AVIC, DES, EMO-DB, eNTERFACE, SAL, SmartKom, SUSAS, and VAM databases. To provide better comparability among sets, we additionally cluster each database’s emotions into binary valence and arousal discrimination tasks. In the result large differences are found among corpora that mostly stem from naturalistic emotions and spontaneous speech vs. more prototypical events. Further, supra-segmental modeling proves significantly beneficial on average when several classes are addressed at a time. I.
The interspeech 2011 speaker state challenge
- in Interspeech
, 2011
"... While the first open comparative challenges in the field of paralinguistics targeted more ‘conventional ’ phenomena such as emotion, age, and gender, there still exists a multiplicity of not yet covered, but highly relevant speaker states and traits. The INTERSPEECH 2011 Speaker State Challenge thus ..."
Abstract
-
Cited by 22 (4 self)
- Add to MetaCart
(Show Context)
While the first open comparative challenges in the field of paralinguistics targeted more ‘conventional ’ phenomena such as emotion, age, and gender, there still exists a multiplicity of not yet covered, but highly relevant speaker states and traits. The INTERSPEECH 2011 Speaker State Challenge thus addresses two new sub-challenges to overcome the usually low compatibility of results: In the Intoxication Sub-Challenge, alcoholisation of speakers has to be determined in two classes; in the Sleepiness Sub-Challenge, another two-class classification task has to be solved. This paper introduces the conditions, the Challenge corpora “Alcohol Language Corpus ” and “Sleepy Language Corpus”, and a standard feature set that may be used. Further, baseline results are given.
AVEC 2011 - the first international audio/visual emotion challenge
- In Proceedings of the International Conference on Affective Computing and Intelligent Interaction
, 2011
"... Abstract. The Audio/Visual Emotion Challenge andWorkshop (AVEC 2011) is the first competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and audiovisual emotion analysis, with all participants competing under strictly the same conditio ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Abstract. The Audio/Visual Emotion Challenge andWorkshop (AVEC 2011) is the first competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and audiovisual emotion analysis, with all participants competing under strictly the same conditions. This paper first describes the challenge par-ticipation conditions. Next follows the data used – the SEMAINE corpus – and its partitioning into train, development, and test partitions for the challenge with labelling in four dimensions, namely activity, expectation, power, and valence. Further, audio and video baseline features are intro-duced as well as baseline results that use these features for the three sub-challenges of audio, video, and audiovisual emotion recognition.
The INTERSPEECH 2010 paralinguistic challenge
- In Proc. Interspeech
, 2010
"... Most paralinguistic analysis tasks are lacking agreed-upon evaluation procedures and comparability, in contrast to more ‘traditional ’ disciplines in speech analysis. The INTERSPEECH 2010 Paralinguistic Challenge shall help overcome the usually low compatibility of results, by addressing three selec ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
(Show Context)
Most paralinguistic analysis tasks are lacking agreed-upon evaluation procedures and comparability, in contrast to more ‘traditional ’ disciplines in speech analysis. The INTERSPEECH 2010 Paralinguistic Challenge shall help overcome the usually low compatibility of results, by addressing three selected subchallenges. In the Age Sub-Challenge, the age of speakers has to be determined in four groups. In the Gender Sub-Challenge, a three-class classification task has to be solved and finally, the Affect Sub-Challenge asks for speakers ’ interest in ordinal representation. This paper introduces the conditions, the Challenge corpora “aGender ” and “TUM AVIC ” and standard feature sets that may be used. Further, baseline results are given.
Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling
- in Proc. of Interspeech, Makuhari
, 2010
"... In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual info ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
(Show Context)
In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and nonprototypical emotional expressions contained in a large audiovisual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72 %, 65 %, and 55 % for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively. Index Terms: emotion recognition, multimodality, long shortterm memory, hidden markov models, context modeling
DEEP NEURAL NETWORKS FOR ACOUSTIC EMOTION RECOGNITION: RAISING THE BENCHMARKS
"... ..."
(Show Context)
Detecting emotional state of a child in a conversational computer game
- Computer Speech and Language
, 2011
"... The automatic recognition of user’s communicative style within a spoken dialog system framework, including the affective aspects, has received increased attention in the past few years. For dialog systems, it is important to know not only what was said but also how something was communicated, so tha ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
The automatic recognition of user’s communicative style within a spoken dialog system framework, including the affective aspects, has received increased attention in the past few years. For dialog systems, it is important to know not only what was said but also how something was communicated, so that the system can engage the user in a richer and more natural interaction. This paper addresses the problem of automatically detecting “frustration”, “politeness”, and “neutral ” attitudes from a child’s speech communication cues, elicited in spontaneous dialog interactions with computer characters. Several information sources such as acoustic, lexical, and contextual features, as well as, their combinations are used for this purpose. The study is based on a Wizard-of-Oz dialog corpus of 103 children, 7-14 years of age, playing a voice activated computer game. Three way classification experiments, as well as, pairwise classification between polite vs. others and frustrated vs. others were performed. Experimental results show that lexi-cal information has more discriminative power than acoustic and contextual cues for detection of politeness, whereas context and acoustic features perform best for frustration detection. Furthermore, the fusion of acoustic, lexical and contextual information provided significantly better classification results. Results also showed that classification per-formance varies with age and gender. Specifically, for the “politeness ” detection task, higher classification accuracy was achieved for females and 10-11 years-olds, compared to males and other age groups respectively. Key words: emotion recognition, spoken dialog systems, children speech, spontaneous speech, natural emotions, child-computer interaction, feature extraction
The Hinterland of Emotions: Facing the Open-Microphone Challenge
"... We first depict the challenge to address all nonprototypical varieties of emotional states signalled in speech in an open microphone setting, i. e. using all data recorded. In the remainder of the article, we illustrate promising strategies, using the FAU Aibo Emotion Corpus, by showing different de ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
(Show Context)
We first depict the challenge to address all nonprototypical varieties of emotional states signalled in speech in an open microphone setting, i. e. using all data recorded. In the remainder of the article, we illustrate promising strategies, using the FAU Aibo Emotion Corpus, by showing different degrees of classification performance for different degrees of prototypicality, and by elaborating on the use of ROC curves, classification confidences, and the use of correlation-based analyses. 1.