Results 1 - 10
of
26
An application of recurrent neural networks to discriminative keyword spotting
"... Keyword spotting is a detection task consisting in discovering the presence of specific spoken words in unconstrained speech. The majority of keyword spotting systems are based on generative hidden Markov models and lack discriminative capabilities. However, discriminative keyword spotting systems ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
(Show Context)
Keyword spotting is a detection task consisting in discovering the presence of specific spoken words in unconstrained speech. The majority of keyword spotting systems are based on generative hidden Markov models and lack discriminative capabilities. However, discriminative keyword spotting systems are based on the estimation of a posteriori probabilities at the frame-level, hence they make use of information from short time spans. This paper presents a discriminative keyword spotting system based on recurrent neural networks only, that uses information from long time spans to estimate keyword probabilities. In a keyword spotting task in a large database of unconstrained speech where an HMM-based speech recogniser achieves a word accuracy of only 65 %, the system achieved a keyword spotting accuracy of 84.5 %.
Robust discriminative keyword spotting for emotionally colored spontaneous speech using bidirectional lstm networks
- In Proc. of ICASSP
, 2009
"... In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative lea ..."
Abstract
-
Cited by 17 (15 self)
- Add to MetaCart
(Show Context)
In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.
Perceptual audio features for unsupervised key-phrase detection
- IEEE ICASSP 2010 - Pt
, 2010
"... We propose a new type of audio feature (HFCC-ENS) as well as an unsupervised method for detecting short sequences of spoken words (key-phrases) within long speech recordings. Our technical contributions are threefold: Firstly, we propose to use bandwidth-adapted filterbanks instead of classical MFCC ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
We propose a new type of audio feature (HFCC-ENS) as well as an unsupervised method for detecting short sequences of spoken words (key-phrases) within long speech recordings. Our technical contributions are threefold: Firstly, we propose to use bandwidth-adapted filterbanks instead of classical MFCC-style filters in the feature extraction step. Secondly, the time resolution of the resulting features is adapted to account for the temporal characteristics of the spoken phrases. Thirdly, the key-phrase detection step is performed by matching sequences of the resulting HFCC-ENS features with features extracted from a target speech recording. We evaluate the proposed method using the German Kiel Corpus and furthermore investigate speech-related properties of the proposed feature.
Robust vocabulary independent keyword spotting with graphical models
- In Automatic Speech Recognition Understanding, 2009. ASRU 2009. IEEE Workshop on
, 2009
"... Abstract—This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is ro ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract—This paper introduces a novel graphical model architecture for robust and vocabulary independent keyword spotting which does not require the training of an explicit garbage model. We show how a graphical model structure for phoneme recognition can be extended to a keyword spotter that is robust with respect to phoneme recognition errors. We use a hidden garbage variable together with the concept of switching parents to model keywords as well as arbitrary speech. This implies that keywords can be added to the vocabulary without having to re-train the model. Thereby the design of our model architecture is optimised to reliably detect keywords rather than to decode keyword phoneme sequences as arbitrary speech, while offering a parameter to adjust the operating point on the Receiver Operating Characteristics curve. Experiments on the TIMIT corpus reveal that our graphical model outperforms a
Learning the inter-frame distance for discriminative template-based keyword detection
- In International Conference on Speech Processing (INTERSPEECH
, 2007
"... This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, Speech-DatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8 % of averaged AUC compared to 87.9 % for the HMM.
DISCRIMINATIVE SPOKEN TERM DETECTION WITH LIMITED DATA
"... We study spoken term detection—the task of determining whether and where a given word or phrase appears in a given segment of speech—in the setting of limited training data. This setting is becoming increasingly important as interest grows in porting spoken term detection to multiple lowresource lan ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
We study spoken term detection—the task of determining whether and where a given word or phrase appears in a given segment of speech—in the setting of limited training data. This setting is becoming increasingly important as interest grows in porting spoken term detection to multiple lowresource languages and acoustic environments. We propose a discriminative algorithm that aims at maximizing the area under the receiver operating characteristic curve, often used to evaluate the performance of spoken term detection systems. We implement the approach using a set of feature functions based on multilayer perceptron classifiers of phones and articulatory features, and experiment on data drawn from the Switchboard database of conversational telephone speech. Our approach outperforms a baseline HMM-based system by a large margin across a number of training set sizes. Index Terms — spoken term detection, discriminative training, AUC, structural SVM 1.
Music Information Retrieval: An Inspirational Guide to Transfer from Related Disciplines
"... The emerging field of Music Information Retrieval (MIR) has been influenced by neighboring domains in signal processing and machine learning, including automatic speech recognition, image processing and text information retrieval. In this contribution, we start with concrete examples for methodology ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The emerging field of Music Information Retrieval (MIR) has been influenced by neighboring domains in signal processing and machine learning, including automatic speech recognition, image processing and text information retrieval. In this contribution, we start with concrete examples for methodology transfer between speech and music processing, oriented on the building blocks of pattern recognition: preprocessing, feature extraction, and classification/decoding. We then assume a higher level viewpoint when describing sources of mutual inspiration derived from text and image information retrieval. We conclude that dealing with the peculiarities of music in MIR research has contributed to advancing the state-of-the-art in other fields, and that many future challenges in MIR are strikingly similar to those that other research areas have been facing.
Strategies for High Accuracy Keyword Detection in Noisy Channels
"... We present design strategies for a keyword spotting (KWS) sys-tem that operates in highly degraded channel conditions with very low signal-to-noise ratio levels. We employ a system combination approach by combining the outputs of multiple large vocabulary automatic speech recognition (LVCSR) sys-tem ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
We present design strategies for a keyword spotting (KWS) sys-tem that operates in highly degraded channel conditions with very low signal-to-noise ratio levels. We employ a system combination approach by combining the outputs of multiple large vocabulary automatic speech recognition (LVCSR) sys-tems, each of which employs a different system design approach targeting three different levels of information: front-end signal processing features (standard cepstra-based, noise-robust mod-ulation and multi layer perceptron features), statistical acous-tic models (gaussian mixtures models (GMM) and subspace GMMs) and keyword search strategies (word-based and phone-based). We also use keyword-aware capabilities in the sys-tem at two levels: in the LVCSR language models by assign-ing higher weights to n-grams with keywords in them and in LVCSR search by using a relaxed pruning threshold for key-words. The LVCSR system outputs are represented as lattice-based unigram indices whose scores are fused by a logistic-regression based classifier to produce the final system combi-nation output. We present the performance of our system in the phase II evaluations of DARPA’s Robust Automatic Transcrip-tion of Speech (RATS) program for both Levantine Arabic and Farsi conversational speech corpora.
DISCRIMINATIVE ARTICULATORY MODELS FOR SPOKEN TERM DETECTION IN LOW-RESOURCE CONVERSATIONAL SETTINGS
"... We study spoken term detection (STD) – the task of determining whether and where a given word or phrase appears in a given segment of speech – using articulatory feature-based pronunciation models. The models are motivated by the requirements of STD in low-resource settings, in which it may not be ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We study spoken term detection (STD) – the task of determining whether and where a given word or phrase appears in a given segment of speech – using articulatory feature-based pronunciation models. The models are motivated by the requirements of STD in low-resource settings, in which it may not be feasible to train a large-vocabulary continuous speech recognition system, as well as by the need to address pronunciation variation in conversational speech. Our STD system is trained to maximize the expected area under the receiver operating characteristic curve, often used to evaluate STD performance. In experimental evaluations on the Switchboard corpus, we find that our approach outperforms a baseline HMMbased system across a number of training set sizes, as well as a discriminative phone-based model in some settings. Index Terms — spoken term detection, articulatory features, AUC, structural SVM, discriminative training 1.
Beyond Deep Learning: Scalable Methods and Models for Learning
, 2013
"... Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, ..."
Abstract
- Add to MetaCart
(Show Context)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission.