Results 1 - 10
of
59
Speech recognition with deep recurrent neural networks
, 2013
"... Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the L ..."
Abstract
-
Cited by 104 (8 self)
- Add to MetaCart
(Show Context)
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7 % on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
"... Offline handwriting recognition—the transcription of images of handwritten text—is an interesting task, in that it combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features ..."
Abstract
-
Cited by 50 (9 self)
- Add to MetaCart
(Show Context)
Offline handwriting recognition—the transcription of images of handwritten text—is an interesting task, in that it combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classification—this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4 % accuracy compared to 87.2 % for the competition winner) despite the fact that neither author understands a word of Arabic. 1
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
"... In this paper, we propose a novel neu-ral network model called RNN Encoder– Decoder that consists of two recurrent neural networks (RNN). One RNN en-codes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another se-quence of symbols. The ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a novel neu-ral network model called RNN Encoder– Decoder that consists of two recurrent neural networks (RNN). One RNN en-codes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another se-quence of symbols. The encoder and de-coder of the proposed model are jointly trained to maximize the conditional prob-ability of a target sequence given a source sequence. The performance of a statisti-cal machine translation system is empiri-cally found to improve by using the con-ditional probabilities of phrase pairs com-puted by the RNN Encoder–Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases. 1
Universal Onset Detection with Bidirectional Long ShortTerm Memory
- Neural Networks,” 11 th International Society for Music Information Retrieval Conference (ISMIR 2010
, 2010
"... Many different onset detection methods have been proposed in recent years. However those that perform well tend to be highly specialised for certain types of music, while those that are more widely applicable give only moderate performance. In this paper we present a new onset detector with superior ..."
Abstract
-
Cited by 33 (17 self)
- Add to MetaCart
(Show Context)
Many different onset detection methods have been proposed in recent years. However those that perform well tend to be highly specialised for certain types of music, while those that are more widely applicable give only moderate performance. In this paper we present a new onset detector with superior performance and temporal precision for all kinds of music, including complex music mixes. It is based on auditory spectral features and relative spectral differences processed by a bidirectional Long Short-Term Memory recurrent neural network, which acts as reduction function. The network is trained with a large database of onset data covering various genres and onset types. Due to the data driven nature, our approach does not require the onset detection method and its parameters to be tuned to a particular type of music. We compare results on the Bello onset data set and can conclude that our approach is on par with related results on the same set and outperforms them in most cases in terms of F1-measure. For complex music with mixed onset types, an absolute improvement of 3.6% is reported. 1.
Towards End-to-End Speech Recognition with Recurrent Neural Networks
"... This paper presents a speech recognition sys-tem that directly transcribes audio data with text, without requiring an intermediate phonetic repre-sentation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Tem-poral Class ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
(Show Context)
This paper presents a speech recognition sys-tem that directly transcribes audio data with text, without requiring an intermediate phonetic repre-sentation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Tem-poral Classification objective function. A mod-ification to the objective function is introduced that trains the network to minimise the expec-tation of an arbitrary transcription loss function. This allows a direct optimisation of the word er-ror rate, even in the absence of a lexicon or lan-guage model. The system achieves a word error rate of 27.3 % on the Wall Street Journal corpus with no prior linguistic information, 21.9 % with only a lexicon of allowed words, and 8.2 % with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%. 1.
Enhanced Beat Tracking with Context-Aware Neural Networks
- in Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11
, 2011
"... We present two new beat tracking algorithms based on the autocorrelation analysis, which showed state-of-the-art performance in the MIREX 2010 beat tracking contest. Unlike the traditional approach of processing a list of onsets, we propose to use a bidirectional Long Short-Term Memory recurrent neu ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
We present two new beat tracking algorithms based on the autocorrelation analysis, which showed state-of-the-art performance in the MIREX 2010 beat tracking contest. Unlike the traditional approach of processing a list of onsets, we propose to use a bidirectional Long Short-Term Memory recurrent neural network to perform a frame by frame beat classification of the signal. As inputs to the network the spectral features of the audio signal and their relative differences are used. The network transforms the signal directly into a beat activation function. An autocorrelation function is then used to determine the predominant tempo to eliminate the erroneously detected- or complement the missing- beats. The first algorithm is tuned for music with constant tempo, whereas the second algorithm is further capable to follow changes in tempo and time signature. 1.
Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling
- in Proc. of Interspeech, Makuhari
, 2010
"... In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual info ..."
Abstract
-
Cited by 16 (8 self)
- Add to MetaCart
(Show Context)
In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and nonprototypical emotional expressions contained in a large audiovisual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72 %, 65 %, and 55 % for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively. Index Terms: emotion recognition, multimodality, long shortterm memory, hidden markov models, context modeling
ICDAR 2009 Arabic Handwriting Recognition Competition
- 10TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION
, 2009
"... This paper describes the Arabic handwriting recognition competition held at ICDAR 2009. This third competition (the first was at ICDAR 2005 and the second at ICDAR 2007) again used the IfN/ENIT-database with Arabic handwritten Tunisian town names. Today, more than 82 research groups from universitie ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper describes the Arabic handwriting recognition competition held at ICDAR 2009. This third competition (the first was at ICDAR 2005 and the second at ICDAR 2007) again used the IfN/ENIT-database with Arabic handwritten Tunisian town names. Today, more than 82 research groups from universities, research centers, and industry are working with this database worldwide. This year, 7 groups with 17 systems were participating in the competition. The systems were tested on known data and on two data sets which are unknown to the participants. The systems were compared based on the most important characteristic: the recognition rate. Additionally, the relative speed of the different systems was compared. A short description of the participating groups, their systems, and the results achieved are finally presented.
G.: Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory
- In: Proc. of ICASSP
, 2011
"... Features generated by Non-Negative Matrix Factorization (NMF) have successfully been introduced into robust speech processing, including noise-robust speech recognition and detection of nonlinguistic vocalizations. In this study, we introduce a novel tandem approach by integrating likelihood feature ..."
Abstract
-
Cited by 10 (8 self)
- Add to MetaCart
(Show Context)
Features generated by Non-Negative Matrix Factorization (NMF) have successfully been introduced into robust speech processing, including noise-robust speech recognition and detection of nonlinguistic vocalizations. In this study, we introduce a novel tandem approach by integrating likelihood features derived from NMF into Bidirectional Long Short-Term Memory Recurrent Neural Networks (BLSTM-RNNs) in order to dynamically localize non-linguistic events, i. e., laughter, vocal, and non-vocal noise, in highly spontaneous speech. We compare our tandem architecture to a baseline conventional phoneme-HMM-based speech recognizer, and achieve a relative reduction of the frame error rate by 37.5 % in the discrimination of speech and different non-speech segments.
ICDAR 2009 Handwriting Recognition Competition
, 2009
"... This paper describes the handwriting recognition competition held at ICDAR 2009. This competition is based on the RIMES-database, with French written text documents. These document are classified in three different categories, complete text pages, words, and isolated characters. This year 10 systems ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper describes the handwriting recognition competition held at ICDAR 2009. This competition is based on the RIMES-database, with French written text documents. These document are classified in three different categories, complete text pages, words, and isolated characters. This year 10 systems were submitted for the handwritten recognition competition on snippets of French words. The systems were evaluated in three subtask depending of the sizes of the used dictionary. A comparison between different classification and recognition systems show interesting results. A short description of the participating groups, their systems, and the results achieved are presented.