Results 1 - 10
of
26
Active And Unsupervised Learning for Automatic Speech Recognition
, 2003
"... State-of-the-art speech recognition systems are trained using human transcriptions of speech utterances. In this paper, we describe a method to combine active and unsupervised learning for automatic speech recognition (ASR). The goal is to minimize the human supervision for training acoustic and lan ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
State-of-the-art speech recognition systems are trained using human transcriptions of speech utterances. In this paper, we describe a method to combine active and unsupervised learning for automatic speech recognition (ASR). The goal is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function. For unsupervised learning, we utilize the remaining untranscribed data by using their ASR output and word confidence scores. Our experiments show that the amount of labeled data needed for a given word accuracy can be reduced by 75% by combining active and unsupervised learning.
Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
"... Deploying an automatic speech recognition system with reasonable performance requires expensive and time-consuming in-domain transcription. Previous work demonstrated that non-professional annotation through Amazon’s Mechanical Turk can match professional quality. We use Mechanical Turk to transcrib ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Deploying an automatic speech recognition system with reasonable performance requires expensive and time-consuming in-domain transcription. Previous work demonstrated that non-professional annotation through Amazon’s Mechanical Turk can match professional quality. We use Mechanical Turk to transcribe conversational speech for as little as one thirtieth the cost of professional transcription. The higher disagreement of non-professional transcribers does not have a significant effect on system performance. While previous work demonstrated that redundant transcription can improve data quality, we found that resources are better spent collecting more data. Finally, we describe a quality control method without needing professional transcription. 1
Classification-based melody transcription
- Machine Learning Journal
, 2006
"... The melody of a musical piece – informally, the part you would hum along with – is a useful and compact summary of a full audio recording. The extraction of melodic content has practical applications ranging from content-based audio retrieval to the analysis of musical structure. Whereas previous sy ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
The melody of a musical piece – informally, the part you would hum along with – is a useful and compact summary of a full audio recording. The extraction of melodic content has practical applications ranging from content-based audio retrieval to the analysis of musical structure. Whereas previous systems generate transcriptions based on a model of the harmonic (or periodic) structure of musical pitches, we present a classification-based system for performing automatic melody transcription that makes no assumptions beyond what is learned from its training data. We evaluate the success of our algorithm by predicting the melody of the ADC 2004 Melody Competition evaluation set, and we show that a simple framelevel note classifier, temporally smoothed by post processing with a hidden Markov model, produces results comparable to state of the art model-based transcription systems. 1
Automatic alignment and error correction of human generated transcripts for long speech recordings
- In Proc. Interspeech
, 2006
"... In this paper we examine the issues of aligning and correcting approximate human generated transcripts for long audio files. Accurate time-aligned transcriptions help provide easier access to audio materials by aiding downstream applications such as the indexing, summarizing and retrieving of audio ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
In this paper we examine the issues of aligning and correcting approximate human generated transcripts for long audio files. Accurate time-aligned transcriptions help provide easier access to audio materials by aiding downstream applications such as the indexing, summarizing and retrieving of audio segments. Accurate time alignments are also necessary when incorporating audio data into the training data for a speech recognizer’s acoustic model. We provide some initial analysis of manual transcriptions which show that there can be significant differences between the “approximate ” manual transcripts generated by typical commercial transcription services and what was actually spoken in the recording. We then present a new alignment approach for approximate transcriptions of long audio files which is designed to discover and correct errors in the manual transcription during the alignment process.
Exploitation of unlabeled sequences in hidden markov models
- IEEE Trans. On Pattern Analysis and Machine Intelligence
, 2003
"... Abstract—This paper presents a method for effectively using unlabeled sequential data in the learning of hidden Markov models (HMMs). With the conventional approach, class labels for unlabeled data are assigned deterministically by HMMs learned from labeled data. Such labeling often becomes unreliab ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract—This paper presents a method for effectively using unlabeled sequential data in the learning of hidden Markov models (HMMs). With the conventional approach, class labels for unlabeled data are assigned deterministically by HMMs learned from labeled data. Such labeling often becomes unreliable when the number of labeled data is small. We propose an extended Baum-Welch (EBW) algorithm in which the labeling is undertaken probabilistically and iteratively so that the labeled and unlabeled data likelihoods are improved. Unlike the conventional approach, the EBW algorithm guarantees convergence to a local maximum of the likelihood. Experimental results on gesture data and speech data show that when labeled training data are scarce, by using unlabeled data, the EBW algorithm improves the classification performance of HMMs more robustly than the conventional naive labeling (NL) approach. Index Terms—Unlabeled data, sequential data, hidden Markov models, extended Baum-Welch algorithm. æ 1
Unsupervised Spoken Keyword Spotting via Segmental DTW on Gaussian Posteriorgrams
"... Abstract—In this paper, we present an unsupervised learning framework to address the problem of detecting spoken ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract—In this paper, we present an unsupervised learning framework to address the problem of detecting spoken
Unsupervised training for Mandarin broadcast news and conversation transcription
- IN PROC. ICASSP
, 2007
"... A significant cost in obtaining acoustic training data is the generation of accurate transcriptions. For some sources close-caption data is available. This allows the use of lightly-supervised training techniques. However, for some sources and languages close-caption is not available. In these cases ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
A significant cost in obtaining acoustic training data is the generation of accurate transcriptions. For some sources close-caption data is available. This allows the use of lightly-supervised training techniques. However, for some sources and languages close-caption is not available. In these cases unsupervised training techniques must be used. This paper examines the use of unsupervised techniques for discriminative training. In unsupervised training automatic transcriptions from a recognition system are used for training. As these transcriptions may be errorful data selection may be useful. Two forms of selection are described, one to remove non-target language shows, the other to remove segments with low confidence. Experiments were carried out on a Mandarin transcriptions task. Two types of test data were considered, Broadcast News (BN) and Broadcast Conversations (BC). Results show that the gains from unsupervised
Development of the CU-HTK 2004 broadcast news transcription systems
- in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing
, 2005
"... This paper describes our recent work on improving broadcast news transcription and presents details of the CU-HTK Broadcast News English (BN-E) transcription system for the DARPA/NIST Rich Transcription 2004 Speech-to-Text (RT04) evaluation. A key focus has been building a system using an order of m ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
This paper describes our recent work on improving broadcast news transcription and presents details of the CU-HTK Broadcast News English (BN-E) transcription system for the DARPA/NIST Rich Transcription 2004 Speech-to-Text (RT04) evaluation. A key focus has been building a system using an order of magnitude more acoustic training data than we have previously attempted. We have also investigated a range of techniques to improve both Minimum Phone Error (MPE) training and the efficient creation of MPEbased narrow-band models. The paper describes two alternative system structures that run in under 10×RT and a further system that runs in less than 1×RT. This final system gives lower word error rates than our 2003 system that ran in 10×RT. 1.
Learning N-Best Correction Models from Implicit User Feedback in a Multi-Modal Local Search Application
"... We describe a novel n-best correction model that can leverage implicit user feedback (in the form of clicks) to improve performance in a multi-modal speech-search application. The proposed model works in two stages. First, the n-best list generated by the speech recognizer is expanded with additiona ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We describe a novel n-best correction model that can leverage implicit user feedback (in the form of clicks) to improve performance in a multi-modal speech-search application. The proposed model works in two stages. First, the n-best list generated by the speech recognizer is expanded with additional candidates, based on confusability information captured via user click statistics. In the second stage, this expanded list is rescored and pruned to produce a more accurate and compact n-best list. Results indicate that the proposed n-best correction model leads to significant improvements over the existing baseline, as well as other traditional n-best rescoring approaches. 1
Unsupervised Training with Directed Manual Transcription for Recognising Mandarin Broadcast Audio
"... The performance of unsupervised discriminative training has been found to be highly dependent on the accuracy of the initial automatic transcription. This paper examines a strategy where a relatively small amount of poorly recognised data are manually transcribed to supplement the automatically tran ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The performance of unsupervised discriminative training has been found to be highly dependent on the accuracy of the initial automatic transcription. This paper examines a strategy where a relatively small amount of poorly recognised data are manually transcribed to supplement the automatically transcribed data. Experiments were carried out on a Mandarin broadcast transcription task using both Broadcast News (BN) and Broadcast Conversation (BC) data. A range of experimental conditions are compared for both maximum likelihood and discriminative training using directed manual transcription. For BC data, using fully unsupervised discriminative training, only 17 % of the reduction in character error rate (CER) from supervised training is obtained. By automatically selecting 18 % of the data for manual transcription yields 50 % of the CER gain from supervised training. The directed approach to selecting data outperforms the use of a random set of data for manual transcription.

