Results 1 - 10
of
23
Speaker recognition: A tutorial
"... A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person’s claimed id ..."
Abstract
-
Cited by 121 (1 self)
- Add to MetaCart
A tutorial on the design and development of automatic speaker-recognition systems is presented. Automatic speaker recognition is the use of a machine to recognize a person from a spoken phrase. These systems can operate in two modes: to identify a particular person or to verify a person’s claimed identity. Speech processing and the basic components of automatic speakerrecognition systems are shown and design tradeoffs are discussed. Then, a new automatic speaker-recognition system is given. This recognizer performs with 98.9 % correct identification. Last, the performances of various systems are compared.
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 31 (14 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Estimation of Glottal Closure Instants in Voiced Speech using the DYPSA Algorithm
- IEEE Trans. Speech Audio Processing
, 2007
"... Phase-Slope Algorithm (DYPSA) for automatic estimation of glottal closure instants (GCIs) in voiced speech. Accurate estimation of GCIs is an important tool that can be applied to a wide range of speech processing tasks including speech analysis, synthesis and coding. DYPSA is automatic and operates ..."
Abstract
-
Cited by 12 (8 self)
- Add to MetaCart
Phase-Slope Algorithm (DYPSA) for automatic estimation of glottal closure instants (GCIs) in voiced speech. Accurate estimation of GCIs is an important tool that can be applied to a wide range of speech processing tasks including speech analysis, synthesis and coding. DYPSA is automatic and operates using the speech signal alone without the need for an EGG signal. The algorithm employs the phase-slope function and a novel phase-slope projection technique for estimating GCI candidates from the speech signal. The most likely candidates are then selected using a dynamic programming technique to minimize a cost function that we define. We review and evaluate three existing methods of GCI estimation and compare the new DYPSA algorithm to them. Results are presented for the APLAWD and SAM databases for which 95.7 % and 93.1 % of GCIs are correctly identified. Index Terms—Closed-phase, glottal closure, speech processing, speech segmentation. I.
Spectral Features for Automatic Text-Independent Speaker Recognition
, 2003
"... Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but e#ective representation that is more stable and discriminative than the original signal. Since the front-end is the first component ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but e#ective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end. In other words, classification can be at most as accurate as the features.
A quantitative assessment of group delay methods for identifying glottal closures in voiced speech
- IEEE Trans. Speech Audio Process
, 2006
"... Abstract—Measures based on the group delay of the LPC residual have been used by a number of authors to identify the time instants of glottal closure in voiced speech. In this paper, we discuss the theoretical properties of three such measures and we also present a new measure having useful properti ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
Abstract—Measures based on the group delay of the LPC residual have been used by a number of authors to identify the time instants of glottal closure in voiced speech. In this paper, we discuss the theoretical properties of three such measures and we also present a new measure having useful properties. We give a quantitative assessment of each measure’s ability to detect glottal closure instants evaluated using a speech database that includes a direct measurement of glottal activity from a Laryngograph/EGG signal. We find that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97 % of larynx cycles with a standard deviation of 0.6 ms and that in 9 % of these cycles an additional excitation instant is found that normally corresponds to glottal opening. We show that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch. If the measures are applied to the preemphasized speech instead of to the LPC residual, we find that the timing accuracy worsens but the detection rate improves slightly. We assess the computational cost of evaluating the measures and we present new recursive algorithms that give a substantial reduction in computation in all cases. Index Terms—Closed phase, glottal closure, group delay, speech analysis. I.
DATA-DRIVEN VOICE SOURCE WAVEFORM MODELLING
"... This paper presents a data-driven approach to the modelling of voice source waveforms. The voice source is a signal that is estimated by inverse-filtering speech signals with an estimate of the vocal tract filter. It is used in speech analysis, synthesis, recognition and coding to decompose a speech ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
This paper presents a data-driven approach to the modelling of voice source waveforms. The voice source is a signal that is estimated by inverse-filtering speech signals with an estimate of the vocal tract filter. It is used in speech analysis, synthesis, recognition and coding to decompose a speech signal into its source and vocal tract filter components. Existing approaches parameterize the voice source signal with physically- or mathematically-motivated models. Though the models are well-defined, estimation of their parameters is not well understood and few are capable of reproducing the large variety of voice source waveforms. Here we present a data-driven approach to classify types of voice source waveforms based upon their melfrequency cepstrum coefficients with Gaussian mixture modelling. A set of ‘prototype ’ waveform classes is derived from a weighted average of voice source cycles from real data. An unknown speech signal is then decomposed into its prototype components and resynthesized. Results indicate that with sixteen voice source classes, low resynthesis errors can be achieved. Index Terms — Voice source, inverse-filtering, closed-phase analysis, LPC
Pitch and MFCC dependent GMM models for speaker identification systems
, 2004
"... Raising the performance of the systems identification speaker still constitutes the object of several research. Recently, we have proposed an approach which jointly exploits the information of the vocal tract and the glottis source. The approach synchronously takes into account the correlation betwe ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Raising the performance of the systems identification speaker still constitutes the object of several research. Recently, we have proposed an approach which jointly exploits the information of the vocal tract and the glottis source. The approach synchronously takes into account the correlation between the two sources of information. The proposed theoretical model which consists of using a joint law is presented in this work. Some restrictions and simplifications were taken into account to show the significance of this approach in practical way. The fundamental frequency and the MFCC coefficients (Mel Frequency Cepstrum Coefficients) were used to represent the information of the source and the vocal tract, respectively. The probability density of the source, in particular, was considered to obey a uniform law. Tests were carried out with only the women speaker coming from de speech telephony database (SPIDRE) recorded from various hand set telephones. In this article, modelling the source information is proposed by using a Gaussian Mixture Model (GMM) rather than the uniform probabilistic model. Tests are extended to all speakers of the SPIDRE database. In this respect, four systems were proposed and compared. The first is a baseline system based on the MFCC and does not use any information from the source. The second examine only the voiced segments of the vocal signal. The last two relate to the suggested approaches according to the two techniques. The source information is supposed to follow a normal distribution in one technique and a logNormal distribution in the other. With the proposed approach, the profit in performance increases by 10,5% for the women, 7% for the men and 8% for all speakers.
Chirp Decomposition of Speech Signals for Glottal Source Estimation
"... Abstract. In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to gua ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. In a previous work, we showed that the glottal source can be estimated from speech signals by computing the Zeros of the Z-Transform (ZZT). Decomposition was achieved by separating the roots inside (causal contribution) and outside (anticausal contribution) the unit circle. In order to guarantee a correct deconvolution, time alignment on the Glottal Closure Instants (GCIs) was shown to be essential. This paper extends the formalism of ZZT by evaluating the Z-transform on a contour possibly different from the unit circle. A method is proposed for determining automatically this contour by inspecting the root distribution. The derived Zeros of the Chirp Z-Transform (ZCZT)-based technique turns out to be much more robust to GCI location errors. 1
ON SEPARATING GLOTTAL SOURCE AND VOCAL TRACT INFORMATION IN TELEPHONY SPEAKER VERIFICATION
"... The popular mel-frequency cepstral coefficients (MFCCs) capture a mixture of speaker-related, phonemic and channel information. Speaker-related information could be further broken down according to articulatory criteria. How these underlying components are exactly mixed in the features is not well u ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The popular mel-frequency cepstral coefficients (MFCCs) capture a mixture of speaker-related, phonemic and channel information. Speaker-related information could be further broken down according to articulatory criteria. How these underlying components are exactly mixed in the features is not well understood. To this end, in this paper we aim at separating the spectra of glottal source and vocal tract using glottal inverse filtering, with an application to speaker recognition over telephone lines. Our experiments on the 10sec-10sec condition of the NIST 2006 SRE corpus suggest that the mel-frequency cepstrum of the voice source is not too useful for recognizing speakers. On the contrary, fusing the vocal tract spectrum with conventional MFCCs improves accuracy, suggesting that vocal tract information should be enhanced. Index Terms — Glottal inverse filtering, speaker recognition, source-filter model, mel-frequency cepstrum 1.
On the Potential of Glottal Signatures for Speaker Recognition
"... Most of current speaker recognition systems are based on features extracted from the magnitude spectrum of speech. However the excitation signal produced by the glottis is expected to convey complementary relevant information about the speaker identity. This paper explores the use of two proposed gl ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Most of current speaker recognition systems are based on features extracted from the magnitude spectrum of speech. However the excitation signal produced by the glottis is expected to convey complementary relevant information about the speaker identity. This paper explores the use of two proposed glottal signatures, derived from the residual signal, for speaker identification. Experiments using these signatures are performed on both TIMIT and YOHO databases. Promising results are shown to outperform other approaches based on glottal features. Besides it is highlighted that the signatures can be used for text-independent speaker recognition and that only several seconds of voiced speech are sufficient for estimating them reliably.

