Results 1 - 10
of
51
Tandem connectionist feature extraction for conventional HMM systems
"... Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estim ..."
Abstract
-
Cited by 242 (24 self)
- Add to MetaCart
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In thisworkweshowalargeimprovementinwordrecognitionperformancebycombiningneural -netdiscriminativefeature processingwithGaussian-mixturedistributionmodeling.Bytrainingthenetworktogeneratethesubwordprobabilityposteriors, thenusingtransformationsoftheseestimatesasthebasefeatures foraconventionally-trainedGaussian-mixturebasedsystem,we achieverelativeerrorratereductions of 35% or mor eonthemulticondition Aurora noisy continuous digits task.
Locating Singing Voice Segments within Music Signals
, 2001
"... A sung vocal line is the prominent feature of much popular music. It would be useful to reliably locate the portions of a musical track during which the vocals are present, both as a ‘signature’ of the piece and as a precursor to automatic recognition of lyrics. Here, we approach this problem by usi ..."
Abstract
-
Cited by 77 (6 self)
- Add to MetaCart
(Show Context)
A sung vocal line is the prominent feature of much popular music. It would be useful to reliably locate the portions of a musical track during which the vocals are present, both as a ‘signature’ of the piece and as a precursor to automatic recognition of lyrics. Here, we approach this problem by using the acoustic classifier of a speech recognizer as a detector for speech-like sounds. Although singing (including a musical background) is a relatively poor match to an acoustic model trained on normal speech, we propose various statistics of the classifier’s output in order to discriminate singing from instrumental accompaniment. A simple HMM allows us to find a best labeling sequence for this uncertain data. On a test set of forty 15 second excerpts of randomly-selected music, our classifier achieved around 80 % classification accuracy at the frame level. The utility of different features, and our plans for eventual lyrics recognition, are discussed. 1.
Computational Auditory Scene Recognition
- In IEEE Int’l Conf. on Acoustics, Speech, and Signal Processing
, 2001
"... v 1 ..."
(Show Context)
Connectionist speech recognition of Broadcast News
, 2002
"... This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to post ..."
Abstract
-
Cited by 38 (15 self)
- Add to MetaCart
This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to posterior probabilities has enabled us to develop a number of novel approaches to confidence estimation, pronunciation modelling and search. In addition we have investigated a new feature extraction technique based on the modulation-filtered spectrogram (MSG), and methods for combining multiple information sources. We have incorporated all of these techniques into a system for the transcription
Speech/music segmentation using entropy and dynamism features in a HMM classification framework
, 2003
"... Inthi paper, we present a new approach towardshia performance speech/musi diech/musiFT onrealiTGI tasks related to theautomati transcriL'Vof broadcast news. In the approach presented here, anarti'#FI neural network (ANN)traiIV on clean speech only (as usedi a standard large vocabulary spee ..."
Abstract
-
Cited by 38 (3 self)
- Add to MetaCart
Inthi paper, we present a new approach towardshia performance speech/musi diech/musiFT onrealiTGI tasks related to theautomati transcriL'Vof broadcast news. In the approach presented here, anarti'#FI neural network (ANN)traiIV on clean speech only (as usedi a standard large vocabulary speechrecogni'TV system)i used as a channel model at the output ofwhiG the entropy and"dynamiNFT wid be measured every 10 ms. These features are then inIP#F-IN over tir through anergodi 2-state (speech and non-speech)hin-s Markov model (HMM)wiM miM)I duratiL constrai-I on each HMM state. ForirI-PGLL i the case of entropy,i i itrop clear (and observed i practied that, on average, the entropy at the output of the ANN wiI be larger for non-speech segments than speech segments presented atthei ieiG In our case, the ANNacousti model was amultiIN-VL perceptron (MLP, as often usedi hybri HMM/ANN systems)generatiD at iI outputestiI-VF- of thephoneti posteriP probabiIN-GF based on theacousti vectors at iI iIT-G Iti s from these outputs, thus from "real"probabiF--'#I that the entropy anddynami' are estiFVINThe 2-state speech/non-speech HMM wiI take thesetwo-di##PIN-Lfeatures (entropy anddynamiGF whose dieIGG-FIN-F wie be modeled through multiI-TVVLIN densiI- or a secondary MLP. The parameters ofthi HMM aretrai-L i asupervi-- mannerusie VierI algori'PP Although the proposed method caneasi# be adapted to other speech/non-speechdiech/non-spe appli/non-sp the present paper only focuses onspeech/musi segmentati- Digment experiati ixperia diperi speech andmusi styles, as well asdi'LLDI temporaldioralITVPG# of the speech andmusi sii#G (real datadiIFDDP'INTD mostly speech, or mostly musiyI iyID"FI the robustness of the approach, alwaysresulti' i a correctsegmentati' performance hierf than 90%.FiIT'#L wewiT ...
Speech/music discrimination for multimedia applications
- in Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing ICASSP 2000
"... Automatic discrimination of speech and music is an important tool in many multimedia applications. Previous work has focused on using long-term features such as differential parameters, variances, and time-averages of spectral parameters. These classifiers use features estimated over windows of 0.5– ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
(Show Context)
Automatic discrimination of speech and music is an important tool in many multimedia applications. Previous work has focused on using long-term features such as differential parameters, variances, and time-averages of spectral parameters. These classifiers use features estimated over windows of 0.5–5 seconds, and are relatively complex. In this paper, we present our results of combining the line spectral frequencies (LSFs) and zero-crossing-based features for frame-level narrowband speech/music discrimination. Our classification results for different types of music and speech show the good discriminating power of these features. Our classification algorithms operate using only a frame delay of 20 ms, making them suitable for real-time multimedia applications. 1
Connectionist feature extraction for conventional HMM systems
- Proc. of ICASSP 00
, 2000
"... Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estim ..."
Abstract
-
Cited by 26 (9 self)
- Add to MetaCart
(Show Context)
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35 % or more on the multicondition AURORA noisy continuous digits task. 1.
Audio-based semantic concept classification for consumer video
- IEEE TASLP
, 2010
"... Abstract—This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representat ..."
Abstract
-
Cited by 26 (10 self)
- Add to MetaCart
(Show Context)
Abstract—This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections. A set of 1873 videos from real users has been annotated with these concepts. Starting with a basic representation of each video clip as a sequence of mel-frequency cepstral coefficient (MFCC) frames, we experiment with three clip-level representations: single Gaussian modeling, Gaussian mixture modeling, and probabilistic latent semantic analysis of a Gaussian component histogram. Using such summary features, we produce support vector machine (SVM) classifiers based on the Kullback–Leibler, Bhattacharyya, or Mahalanobis distance measures. Quantitative evaluation shows that our approaches are effective for detecting interesting concepts in a large collection of real-world consumer video clips. Index Terms—Audio classification, consumer video classification, semantic concept detection, soundtrack analysis. I.
Audio Segmentation, Classification and Clustering in a Broadcast News Task
, 2003
"... This paper describes our work on the development of an audio segmentation, classification and clustering system applied to a Broadcast News task for the European Portuguese language. ..."
Abstract
-
Cited by 24 (5 self)
- Add to MetaCart
This paper describes our work on the development of an audio segmentation, classification and clustering system applied to a Broadcast News task for the European Portuguese language.
Content-based Video Retrieval: An overview
, 2000
"... Content-based Image Retrieval systems (CBIRS) start ourishing on the Web. Their performances are continuously improving and their base principles span a wide range of diversity. Content-based Video Retrieval systems (CBVRS) are less common and seem at a first glance to be a natural extension of CBIR ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Content-based Image Retrieval systems (CBIRS) start ourishing on the Web. Their performances are continuously improving and their base principles span a wide range of diversity. Content-based Video Retrieval systems (CBVRS) are less common and seem at a first glance to be a natural extension of CBIRS. In this document, we summarise advances made in the development of CBVRS and analyse their relationship to CBIRS. While doing so, we show that CBVRS are actually not so obvious extensions of CBIRS.