Results 1 - 10
of
64
Support vector machines for speaker and language recognition
- Computer Speech and Language
, 2006
"... ..."
An overview of text-independent speaker recognition: from features to supervectors
, 2009
"... This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of ..."
Abstract
-
Cited by 31 (14 self)
- Add to MetaCart
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Phonetic speaker recognition with support vector machines
- in Advances in Neural Information Processing Systems
, 2004
"... A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches. 1
Language Identification of Encrypted VoIP Traffic: Alejandra y Roberto or Alice and Bob
- In Proceedings of the USENIX Security Symposium
, 2007
"... Voice over IP (VoIP) has become a popular protocol for making phone calls over the Internet. Due to the potential transit of sensitive conversations over untrusted network infrastructure, it is well understood that the contents of a VoIP session should be encrypted. However, we demonstrate that curr ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
Voice over IP (VoIP) has become a popular protocol for making phone calls over the Internet. Due to the potential transit of sensitive conversations over untrusted network infrastructure, it is well understood that the contents of a VoIP session should be encrypted. However, we demonstrate that current cryptographic techniques do not provide adequate protection when the underlying audio is encoded using bandwidth-saving Variable Bit Rate (VBR) coders. Explicitly, we use the length of encrypted VoIP packets to tackle the challenging task of identifying the language of the conversation. Our empirical analysis of 2,066 native speakers of 21 different languages shows that a substantial amount of information can be discerned from encrypted VoIP traffic. For instance, our 21-way classifier achieves 66 % accuracy, almost a 14-fold improvement over random guessing. For 14 of the 21 languages, the accuracy is greater than 90%. We achieve an overall binary classification (e.g., “Is this a Spanish or English conversation?”) rate of 86.6%. Our analysis highlights what we believe to be interesting new privacy issues in VoIP. 1
Language Identification Using Gaussian Mixture Model Tokenization
- IEEE ICASSP
"... Phone tokenization followed by n-gram language modeling has consistently provided good results for the task of language identification. In this paper, this technique is generalized by using Gaussian mixture models as the basis for tokenizing. Performance results are presented for a system employing ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Phone tokenization followed by n-gram language modeling has consistently provided good results for the task of language identification. In this paper, this technique is generalized by using Gaussian mixture models as the basis for tokenizing. Performance results are presented for a system employing a GMM tokenizer in conjunction with multiple language processing and score combination techniques. On the 1996 CallFriend LID evaluation set, a 12-way closed set error rate of 17% was obtained.
Approaches to language identification using Gaussian mixture models and shifted delta cepstral features
- Proc. ICSLP 2002
, 2002
"... Published results indicate that automatic language identification (LID) systems that rely on multiple-language phone recognition and n-gram language modeling produce the best performance in formal LID evaluations. By contrast, Gaussian mixture model (GMM) systems, which measure acoustic characterist ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Published results indicate that automatic language identification (LID) systems that rely on multiple-language phone recognition and n-gram language modeling produce the best performance in formal LID evaluations. By contrast, Gaussian mixture model (GMM) systems, which measure acoustic characteristics, are far more efficient computationally but have tended to provide inferior levels of performance. This paper describes two GMM-based approaches to language identification that use shifted delta cepstra (SDC) feature vectors to achieve LID performance comparable to that of the best phone-based systems. The approaches include both acoustic scoring and a recently developed GMM tokenization system that is based on a variation of phonetic recognition and language modeling. System performance is evaluated on both the CallFriend and OGI corpora. 1.
Segment-Based Automatic Language Identification
, 1997
"... This paper discusses the formulation, development and analysis of a segment-based approach to the Automatic Language Identification (LID) problem. This system utilizes phonotactic, acoustic-phonetic and prosodic information within a unified probabilistic framework. The implementation of this framewo ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper discusses the formulation, development and analysis of a segment-based approach to the Automatic Language Identification (LID) problem. This system utilizes phonotactic, acoustic-phonetic and prosodic information within a unified probabilistic framework. The implementation of this framework allows the relative contributions of different sources of information to be determined empirically, as well as providing the mechanism for combining them within one system. The system has been evaluated using the OGI Multi-Language Telephone Speech Corpus and the results are competetive with other current LID systems. The results have also indicated that, while the phontotactic information of a spoken utterace is the most useful information for LID, acoustic-phonetic and prosodic information can be useful for increasing a system's accuracy, especially when the utterance is short.
Language Identification With Language-Independent Acoustic Models
- Proc. Eurospeech
, 1997
"... In this paper we explore the use of languageindependent acoustic models for language identification (LID). The phone sequence output by a single language-independent phone recognizer is rescored with language-dependent phonotactic models approximated by phone bigrams. The language-independent phonem ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
In this paper we explore the use of languageindependent acoustic models for language identification (LID). The phone sequence output by a single language-independent phone recognizer is rescored with language-dependent phonotactic models approximated by phone bigrams. The language-independent phoneme inventory was obtained by Agglomerative Hierarchical Clustering, using a measure of similarity between phones. This system is compared with a parallel language-dependent phone architecture, which uses optimally the acoustic log likelihood and the phonotactic score for language identification. Experiments were carried out on the 4-language telephone speech corpus IDEAL, containing calls in British English, Spanish, French and German. Results show that the language-independent approach performs as well as the language-dependent one: 9% versus 10% of error rate on 10 second chunks, for the 4-language task. 1. INTRODUCTION This paper presents some of our recent research on automatic language ...
Automatic language identification
, 2001
"... Automatic language identification of speech is the process by which the language of a digitized speech utterance is recognized by a computer. In this paper, we will describe the set of available cues for language identification of speech and discuss the different approaches to building working syste ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Automatic language identification of speech is the process by which the language of a digitized speech utterance is recognized by a computer. In this paper, we will describe the set of available cues for language identification of speech and discuss the different approaches to building working systems. This overview includes a range of historical approaches, contemporary systems that have been evaluated on standard databases, and promising future approaches. Comparative
FST-based recognition techniques for multi-lingual and multi-domain spontaneous speech
- Proceedings of the European Conference on Speech Communication and Technology
, 2001
"... In this paper we present techniques for building multi-domain and multi-lingual recognizers within a finite-state transducer (FST) framework. The flexibility of the FST approach is also demonstrated on the task of incorporating networks modeling different types of non-speech events into an existing ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In this paper we present techniques for building multi-domain and multi-lingual recognizers within a finite-state transducer (FST) framework. The flexibility of the FST approach is also demonstrated on the task of incorporating networks modeling different types of non-speech events into an existing word lattice network. The ability to create robust multi-domain and/or multi-lingual recognizers for spontaneous speech will enable a conversational system to switch seamlessly and automatically among different domains and/or languages. Preliminary results using a bi-domain recognizer exhibit only small recognition accuracy degradation in comparison to domain-dependent recognition. Similarly promising results were observed using a bilingual recognizer which performs simultaneous language identification and recognition. When using the FST techniques to add non-speech models to the recognizer, experiments show a 10 % reduction in word error rate across all utterances and a 30% reduction on utterances containing non-speech events. 1.

