• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Supervised sequence labelling with recurrent neural networks, (2012)

by A Graves
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 59
Next 10 →

Speech recognition with deep recurrent neural networks

by Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton , 2013
"... Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the L ..."
Abstract - Cited by 104 (8 self) - Add to MetaCart
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7 % on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
(Show Context)

Citation Context

...5] have also seen a recent revival [6, 7], but do not currently perform as well as deep networks. Instead of combining RNNs with HMMs, it is possible to train RNNs ‘end-to-end’ for speech recognition =-=[8, 9, 10]-=-. This approach exploits the larger state-space and richer dynamics of RNNs compared to HMMs, and avoids the problem of using potentially incorrect alignments as training targets. The combination of L...

Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks

by Alex Graves, Jürgen Schmidhuber
"... Offline handwriting recognition—the transcription of images of handwritten text—is an interesting task, in that it combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features ..."
Abstract - Cited by 50 (9 self) - Add to MetaCart
Offline handwriting recognition—the transcription of images of handwritten text—is an interesting task, in that it combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks—multidimensional recurrent neural networks and connectionist temporal classification—this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4 % accuracy compared to 87.2 % for the competition winner) despite the fact that neither author understands a word of Arabic. 1
(Show Context)

Citation Context

...rchical structure. In what follows we describe each component in turn, then show how they fit together to form a complete system. For a more detailed description of (1) and (2) we refer the reader to =-=[4]-=- 2.1 Multidimensional Recurrent Neural Networks The basic idea of multidimensional recurrent neural networks (MDRNNs) [7] is to replace the single recurrent connection found in standard recurrent netw...

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

by Kyunghyun Cho, Caglar Gulcehre, Universite ́ De Montréal, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio
"... In this paper, we propose a novel neu-ral network model called RNN Encoder– Decoder that consists of two recurrent neural networks (RNN). One RNN en-codes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another se-quence of symbols. The ..."
Abstract - Cited by 38 (4 self) - Add to MetaCart
In this paper, we propose a novel neu-ral network model called RNN Encoder– Decoder that consists of two recurrent neural networks (RNN). One RNN en-codes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another se-quence of symbols. The encoder and de-coder of the proposed model are jointly trained to maximize the conditional prob-ability of a target sequence given a source sequence. The performance of a statisti-cal machine translation system is empiri-cally found to improve by using the con-ditional probabilities of phrase pairs com-puted by the RNN Encoder–Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases. 1
(Show Context)

Citation Context

...emory cell and four gating units that adaptively control the information flow inside the unit, compared to only two gating units in the proposed hidden unit. For details on LSTM networks, see, e.g., (=-=Graves, 2012-=-). z rh h~ x Figure 2: An illustration of the proposed hidden activation function. The update gate z selects whether the hidden state is to be updated with a new hidden state h̃. The reset gate r deci...

Universal Onset Detection with Bidirectional Long ShortTerm Memory

by Florian Eyben, Sebastian Böck, Björn Schuller, Technische Universität München, Alex Graves, Technische Universität München - Neural Networks,” 11 th International Society for Music Information Retrieval Conference (ISMIR 2010 , 2010
"... Many different onset detection methods have been proposed in recent years. However those that perform well tend to be highly specialised for certain types of music, while those that are more widely applicable give only moderate performance. In this paper we present a new onset detector with superior ..."
Abstract - Cited by 33 (17 self) - Add to MetaCart
Many different onset detection methods have been proposed in recent years. However those that perform well tend to be highly specialised for certain types of music, while those that are more widely applicable give only moderate performance. In this paper we present a new onset detector with superior performance and temporal precision for all kinds of music, including complex music mixes. It is based on auditory spectral features and relative spectral differences processed by a bidirectional Long Short-Term Memory recurrent neural network, which acts as reduction function. The network is trained with a large database of onset data covering various genres and onset types. Due to the data driven nature, our approach does not require the onset detection method and its parameters to be tuned to a particular type of music. We compare results on the Bello onset data set and can conclude that our approach is on par with related results on the same set and outperforms them in most cases in terms of F1-measure. For complex music with mixed onset types, an absolute improvement of 3.6% is reported. 1.
(Show Context)

Citation Context

...etwork we use a bidirectional recurrent neural network with Long Short-Term Memory [13] hidden units. Such networks were proven to work well on other audio detection tasks, such as speech recognition =-=[10]-=-. This section gives a short introduction to ANN with a focus on bidirectional Long Short-Term Memory (BLSTM) networks, which are used for the proposed onset detector. 3.1 Feed forward neural networks...

Towards End-to-End Speech Recognition with Recurrent Neural Networks

by Alex Graves, Google Deepmind, Navdeep Jaitly
"... This paper presents a speech recognition sys-tem that directly transcribes audio data with text, without requiring an intermediate phonetic repre-sentation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Tem-poral Class ..."
Abstract - Cited by 21 (3 self) - Add to MetaCart
This paper presents a speech recognition sys-tem that directly transcribes audio data with text, without requiring an intermediate phonetic repre-sentation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Tem-poral Classification objective function. A mod-ification to the objective function is introduced that trains the network to minimise the expec-tation of an arbitrary transcription loss function. This allows a direct optimisation of the word er-ror rate, even in the absence of a lexicon or lan-guage model. The system achieves a word error rate of 27.3 % on the Wall Street Journal corpus with no prior linguistic information, 21.9 % with only a lexicon of allowed words, and 8.2 % with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%. 1.
(Show Context)

Citation Context

...em where as much of the speech pipeline as possible is replaced by a single recurrent neural network (RNN) architecture. Although it is possible to directly transcribe raw speech waveforms with RNNs (=-=Graves, 2012-=-, Chapter 9) or features learned with restricted Boltzmann machines (Jaitly & Hinton, 2011), the computational cost is high and performance tends to be worse than conventional preprocessing. We have t...

Enhanced Beat Tracking with Context-Aware Neural Networks

by Sebastian Böck, Markus Schedl - in Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11 , 2011
"... We present two new beat tracking algorithms based on the autocorrelation analysis, which showed state-of-the-art performance in the MIREX 2010 beat tracking contest. Unlike the traditional approach of processing a list of onsets, we propose to use a bidirectional Long Short-Term Memory recurrent neu ..."
Abstract - Cited by 18 (3 self) - Add to MetaCart
We present two new beat tracking algorithms based on the autocorrelation analysis, which showed state-of-the-art performance in the MIREX 2010 beat tracking contest. Unlike the traditional approach of processing a list of onsets, we propose to use a bidirectional Long Short-Term Memory recurrent neural network to perform a frame by frame beat classification of the signal. As inputs to the network the spectral features of the audio signal and their relative differences are used. The network transforms the signal directly into a beat activation function. An autocorrelation function is then used to determine the predominant tempo to eliminate the erroneously detected- or complement the missing- beats. The first algorithm is tuned for music with constant tempo, whereas the second algorithm is further capable to follow changes in tempo and time signature. 1.
(Show Context)

Citation Context

...d. It has the ability to model any temporal context around a given input value. BLSTM networks performed very well in areas like phoneme and handwriting recognition and are described more detailed in =-=[6]-=-. 4. ALGORITHM DESCRIPTION This section describes our algorithm for beat detection in audio signals. It is based on bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks. Due to their...

Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling

by Martin Wöllmer, Angeliki Metallinou, Florian Eyben, Björn Schuller, Shrikanth Narayanan - in Proc. of Interspeech, Makuhari , 2010
"... In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual info ..."
Abstract - Cited by 16 (8 self) - Add to MetaCart
In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and nonprototypical emotional expressions contained in a large audiovisual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72 %, 65 %, and 55 % for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively. Index Terms: emotion recognition, multimodality, long shortterm memory, hidden markov models, context modeling
(Show Context)

Citation Context

...lem and gives access to long range context information. The combination of bidirectional networks and LSTM is called bidirectional LSTM. A detailed explanation of BLSTM networks can be found e. g. in =-=[13]-=-. The LSTM networks applied for our experiments consist of 128 memory blocks with one memory cell per block. The number of input nodes corresponds to the number of different features per utterance whe...

ICDAR 2009 Arabic Handwriting Recognition Competition

by Volker Märgner, Haikal El Abed - 10TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION , 2009
"... This paper describes the Arabic handwriting recognition competition held at ICDAR 2009. This third competition (the first was at ICDAR 2005 and the second at ICDAR 2007) again used the IfN/ENIT-database with Arabic handwritten Tunisian town names. Today, more than 82 research groups from universitie ..."
Abstract - Cited by 13 (0 self) - Add to MetaCart
This paper describes the Arabic handwriting recognition competition held at ICDAR 2009. This third competition (the first was at ICDAR 2005 and the second at ICDAR 2007) again used the IfN/ENIT-database with Arabic handwritten Tunisian town names. Today, more than 82 research groups from universities, research centers, and industry are working with this database worldwide. This year, 7 groups with 17 systems were participating in the competition. The systems were tested on known data and on two data sets which are unknown to the participants. The systems were compared based on the most important characteristic: the recognition rate. Additionally, the relative speed of the different systems was compared. A short description of the participating groups, their systems, and the results achieved are finally presented.

G.: Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory

by Felix Weninger, Björn Schuller, Martin Wöllmer, Gerhard Rigoll - In: Proc. of ICASSP , 2011
"... Features generated by Non-Negative Matrix Factorization (NMF) have successfully been introduced into robust speech processing, including noise-robust speech recognition and detection of nonlinguistic vocalizations. In this study, we introduce a novel tandem approach by integrating likelihood feature ..."
Abstract - Cited by 10 (8 self) - Add to MetaCart
Features generated by Non-Negative Matrix Factorization (NMF) have successfully been introduced into robust speech processing, including noise-robust speech recognition and detection of nonlinguistic vocalizations. In this study, we introduce a novel tandem approach by integrating likelihood features derived from NMF into Bidirectional Long Short-Term Memory Recurrent Neural Networks (BLSTM-RNNs) in order to dynamically localize non-linguistic events, i. e., laughter, vocal, and non-vocal noise, in highly spontaneous speech. We compare our tandem architecture to a baseline conventional phoneme-HMM-based speech recognizer, and achieve a relative reduction of the frame error rate by 37.5 % in the discrimination of speech and different non-speech segments.
(Show Context)

Citation Context

...egments in spontaneous speech, we briefly address theoretical differences between those types of networks, and especially motivate the use of LSTM for the task. For a detailed discussion, we refer to =-=[8]-=-. In contrast to basic feedforward neural networks, recurrent connections from the output to the input provide an RNN with a kind of memory, which may influence the network output in the future. Next,...

ICDAR 2009 Handwriting Recognition Competition

by Emmanuèle Grosicki, Haikal El Abed , 2009
"... This paper describes the handwriting recognition competition held at ICDAR 2009. This competition is based on the RIMES-database, with French written text documents. These document are classified in three different categories, complete text pages, words, and isolated characters. This year 10 systems ..."
Abstract - Cited by 9 (0 self) - Add to MetaCart
This paper describes the handwriting recognition competition held at ICDAR 2009. This competition is based on the RIMES-database, with French written text documents. These document are classified in three different categories, complete text pages, words, and isolated characters. This year 10 systems were submitted for the handwritten recognition competition on snippets of French words. The systems were evaluated in three subtask depending of the sizes of the used dictionary. A comparison between different classification and recognition systems show interesting results. A short description of the participating groups, their systems, and the results achieved are presented.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University