Results 1 - 10
of
25
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 31 (14 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Melody description and extraction in the context of music content processing
- Journal of New Music Research
, 2003
"... A huge amount of audio data is accessible to everyone by on-line or off-line information services and it is necessary to develop techniques to automatically describe and deal with this data in a meaningful way. In the particular context of music content processing it is important to take into accoun ..."
Abstract
-
Cited by 26 (5 self)
- Add to MetaCart
A huge amount of audio data is accessible to everyone by on-line or off-line information services and it is necessary to develop techniques to automatically describe and deal with this data in a meaningful way. In the particular context of music content processing it is important to take into account the melodic aspects of the sound. The goal of this article is to review the different techniques proposed for melodic description and extraction. Some ideas around the concept of melody are first presented. Then, an overview of the different ways of describing melody is done. As a third step, an analysis of the methods proposed for melody extraction is made, including pitch detection algorithms. Finally, techniques for melodic pattern induction and matching are also studied, and some useful melodic transformations are reviewed. 1
Enhanced Pitch Tracking And The Processing Of F0 Contours For Computer Aided Intonation Teaching
- in Proceedings of the 3rd European Conference on Speech Communication and Technology
, 1993
"... A comparative evaluation of several pitch determination algorithms (PDAs) is presented. Fundamental frequency estimates, F0, are compared with laryngeal frequency estimates, Lx. An algorithm is presented which enables Lx contours to be generated from laryngograph data. We seek the most accurate me ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
A comparative evaluation of several pitch determination algorithms (PDAs) is presented. Fundamental frequency estimates, F0, are compared with laryngeal frequency estimates, Lx. An algorithm is presented which enables Lx contours to be generated from laryngograph data. We seek the most accurate method of F0 extraction in order to minimise errors propagating into subsequent prosodic analysis. The super resolution pitch determinator [3] performs well relative to the other PDAs studied. Modifications made to this algorithm are described, which radically reduce the number of gross F0 errors and improve the classification of voiced and unvoiced sections of speech. The raw F0 contours produced by this enhanced algorithm are processed to form schematised contours used in computer aided intonation teaching. The series of processes used in the schematisation is described. Keywords: Pitch tracking, Intonation, Language teaching 1 INTRODUCTION The fundamental frequency of speech plays an imp...
Automatic Prosodic Analysis for Computer Aided Pronunciation Teaching
, 1994
"... Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for p ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech...
Automatic pitch contour stylization using a model of tonal perception
- Comput. Speech Language
, 1995
"... of tonal perception. ..."
Modeling Auxiliary Information in Bayesian Network Based ASR
- In 7th European Conference on Speech Communication and Technology
, 2001
"... Automatic speech recognition bases its models on the acoustic features derived from the speech signal. Some have investigated replacing or supplementing these features with information that can not be precisely measured (articulator positions, pitch, gender, etc.) automatically. Consequently, automa ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Automatic speech recognition bases its models on the acoustic features derived from the speech signal. Some have investigated replacing or supplementing these features with information that can not be precisely measured (articulator positions, pitch, gender, etc.) automatically. Consequently, automatic estimations of the desired information would be generated. This data can degrade performance due to its imprecisions. In this paper, we describe a system that treats pitch as an auxiliary information within the framework of Bayesian networks, resulting in improved performance. 1.
Hierarchical Filtering Method for Content-based Music Retrieval via Acoustic Input
- Proc. ACM Multimedia
, 2001
"... This paper presents an implementation of a content-based music retrieval system that can take a user’s acoustic input (S-second clip of singing or humming) via a microphone and then retrieve the intended song from a database containing over 3000 candidate songs. The system, known as Super MBox, demo ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This paper presents an implementation of a content-based music retrieval system that can take a user’s acoustic input (S-second clip of singing or humming) via a microphone and then retrieve the intended song from a database containing over 3000 candidate songs. The system, known as Super MBox, demonstrates the feasibility of real-time music retrieval with a high success rate. Super MBox first takes the user’s acoustic input from a microphone and converts it into a pitch vector. Then a hierarchical filtering method (HFM) is used to first filter out 80% unlikely candidates and then compare the query input with the remaining 20 % candidates in a detailed manner. The output of Super MBox is a ranked song list according to the computed similarity scores. A brief mathematical analysis of the two-step HFM is given in the paper to explain how to derive the optimum parameters of the comparison engine. The proposed HFM and its analysis framework can be directly applied to other multimedia information retrieval systems. We have tested Super MBox extensively and found the top-20 success rate is over 85%, based on a dataset of about singing/humming 2000 clips from people with mediocre singing skills. Our studies demonstrate the feasibility of using Super MBox as a prototype for music search engines over the Internet and/or query engines in digital music libraries.
Real time voice processing with audiovisual feedback: toward autonomous agents with perfect pitch
- Advances in Neural Information Processing Systems 15
, 2002
"... We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours.
A Probabilistic Approach to AMDF Pitch Detection
"... We present a probabilistic error correction technique to be used with an average magnitude difference function (AMDF) based pitch detector. This error correction routine provides avery simple method to correct errors in pitch period estimation. Used in conjunction with the computationally efficient ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We present a probabilistic error correction technique to be used with an average magnitude difference function (AMDF) based pitch detector. This error correction routine provides avery simple method to correct errors in pitch period estimation. Used in conjunction with the computationally efficient AMDF, the result is a fast and accurate pitch detector. In performance tests on the CSTR (Center for Speech Technology Research) database, probabilistic error correction reduced the gross error rate from 6.07% to 3.29%.
Automatic Detection of Prosodic Stress in American English Discourse
, 2000
"... Due to the incompletely understood nature of prosodic stress, the implementation of an automatic transcriber is very difficult on the basis of the currently available knowledge. For this reason, a number of data driven approaches are applied to a manually annotated set of files from the OGI English ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Due to the incompletely understood nature of prosodic stress, the implementation of an automatic transcriber is very difficult on the basis of the currently available knowledge. For this reason, a number of data driven approaches are applied to a manually annotated set of files from the OGI English Stories Corpus. The goal of this analysis is twofold. First, it aims to implement an automatic detector of prosodic stress with sufficiently reliable performance. Second, the effectiveness of the acoustic features most commonly proposed in the literature is assessed. That is, the role played by duration, amplitude and fundamental frequency of syllabic nuclei is investigated. Several data-driven algorithms, such as Artificial Neural Networks (ANN), statistical decision trees and fuzzy classification techniques, and a knowledge-based heuristic algorithm are implemented for the automatic transcription of prosodic stress. As reference, two different subsets from the OGI English stories database were hand labeled in terms of prosodic stress by two individuals trained in linguistics. The agreement between the two transcribers on a set of common files is only slightly higher than that obtained by the automatic systems. While the ANN based approach achieves the highest performance (77% primarily stressed vocalic nuclei vs. 79% unstressed vocalic nuclei in average for the two transcribers data sets), the other methods show that both transcribers grant a major role to duration and (to a slightly lesser degree) to amplitude. Pitch relevant features of the syllabic nuclei appear to play a much less important role than amplitude and duration.

