Results 1 -
9 of
9
Maxout networks
- In ICML
, 2013
"... We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to bot ..."
Abstract
-
Cited by 68 (17 self)
- Add to MetaCart
(Show Context)
We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout’s fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.
RECENT ADVANCES IN DEEP LEARNING FOR SPEECH RESEARCH AT MICROSOFT
"... Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the ..."
Abstract
-
Cited by 23 (10 self)
- Add to MetaCart
(Show Context)
Deep learning is becoming a mainstream technology for speech recognition at industrial scale. In this paper, we provide an overview of the work by Microsoft speech researchers since 2009 in this area, focusing on more recent advances which shed light to the basic capabilities and limitations of the current deep learning technology. We organize this overview along the feature-domain and model-domain dimensions according to the conventional approach to analyzing speech systems. Selected experimental results, including speech recognition and related applications such as spoken dialogue and language modeling, are presented to demonstrate and analyze the strengths and weaknesses of the techniques described in the paper. Potential improvement of these techniques and future research directions are discussed. Index Terms — deep learning, neural network, multilingual, speech recognition, spectral features, convolution, dialogue
Towards deeper understanding: Deep convex networks for semantic utterance classification,” ICASSP
, 2012
"... Following the recent advances in deep learning techniques, in this paper, we present the application of special type of deep architecture — deep convex networks (DCNs) — for semantic utterance classification (SUC). DCNs are shown to have several advantages over deep belief networks (DBNs) including ..."
Abstract
-
Cited by 18 (13 self)
- Add to MetaCart
(Show Context)
Following the recent advances in deep learning techniques, in this paper, we present the application of special type of deep architecture — deep convex networks (DCNs) — for semantic utterance classification (SUC). DCNs are shown to have several advantages over deep belief networks (DBNs) including classification accuracy and training scalability. However, adoption of DCNs for SUC comes with non-trivial issues. Specifically, SUC has an extremely sparse input feature space encompassing a very large number of lexical and semantic features. This is about a few thousand times larger than the feature space for acoustic modeling, yet with a much smaller number of training samples. Experimental results we obtained on a domain classification task for spoken language understanding demonstrate the effectiveness of DCNs. The DCN-based method produces higher SUC accuracy than the Boosting-based discriminative classifier with word trigrams. Index Terms — deep convex networks, spoken language understanding, domain detection, semantic utterance classification, deep learning
Discriminative Recurrent Sparse Auto-Encoders
, 1301
"... We present the discriminative recurrent sparse auto-encoder model, comprising a recurrent encoder of rectified linear units, unrolled for a fixed number of iterations, and connected to two linear decoders that reconstruct the input and predict its supervised classification. Training via backpropagat ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
We present the discriminative recurrent sparse auto-encoder model, comprising a recurrent encoder of rectified linear units, unrolled for a fixed number of iterations, and connected to two linear decoders that reconstruct the input and predict its supervised classification. Training via backpropagation-through-time initially minimizes an unsupervised sparse reconstruction error; the loss function is then augmented with a discriminative term on the supervised classification. The depth implicit in the temporally-unrolled form allows the system to exhibit far more representational power, while keeping the number of trainable parameters fixed. From an initially unstructured network the hidden units differentiate into categorical-units, each of which represents an input prototype with a well-defined class; and part-units representing deformations of these prototypes. The learned organization of the recurrent encoder is hierarchical: part-units are driven directly by the input, whereas the activity of categorical-units builds up over time through interactions with the part-units. Even using a small number of hidden units per layer, discriminative recurrent sparse auto-encoders achieve excellent performance on MNIST. 1
Top-down regularization of deep belief networks
- in Advances in Neural Information Processing Systems (NIPS
, 2013
"... Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap betwe ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Designing a principled and effective algorithm for learning deep architectures is a challenging problem. The current approach involves two training phases: a fully unsupervised learning followed by a strongly discriminative optimization. We suggest a deep learning strategy that bridges the gap between the two phases, re-sulting in a three-phase learning procedure. We propose to implement the scheme using a method to regularize deep belief networks with top-down information. The network is constructed from building blocks of restricted Boltzmann machines learned by combining bottom-up and top-down sampled signals. A global op-timization procedure that merges samples from a forward bottom-up pass and a top-down pass is used. Experiments on the MNIST dataset show improvements over the existing algorithms for deep belief networks. Object recognition results on the Caltech-101 dataset also yield competitive results. 1
Language models with meta-information
, 2014
"... ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 11 maart 2014 om 12:30 uur door ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K.C.A.M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 11 maart 2014 om 12:30 uur door
Modeling genre with the music genome project: Comparing human-labeled attributes and audio features
- In Proc. of the International Society for Music Information Retrieval Conference
"... Genre provides one of the most convenient categorizations of music, but it is often regarded as a poorly defined or largely subjective musical construct. In this work, we provide evidence that musical genres can to a large ex-tent be objectively modeled via a combination of musi-cal attributes. We e ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Genre provides one of the most convenient categorizations of music, but it is often regarded as a poorly defined or largely subjective musical construct. In this work, we provide evidence that musical genres can to a large ex-tent be objectively modeled via a combination of musi-cal attributes. We employ a data-driven approach utiliz-ing a subset of 48 hand-labeled musical attributes com-prising instrumentation, timbre, and rhythm across more than one million examples from Pandorar Internet Ra-dio’s Music Genome Projectr. A set of audio features motivated by timbre and rhythm are then implemented to model genre both directly and through audio-driven mod-els derived from the hand-labeled musical attributes. In most cases, machine learning models built directly from hand-labeled attributes outperform models based on audio features. Among the audio-based models, those that com-bine audio features and learned musical attributes perform better than those derived from audio features alone. 1.
Efficient learning for spoken language understanding tasks with word embedding based pre-training
- in Proceedings of the Interspeech
, 2015
"... Spoken language understanding (SLU) tasks such as goal estimation and intention identifi-cation from user’s commands are essential components in spoken dialog systems. In recent years, neural network approaches have shown great success in various SLU tasks. However, one major difficulty of SLU is th ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Spoken language understanding (SLU) tasks such as goal estimation and intention identifi-cation from user’s commands are essential components in spoken dialog systems. In recent years, neural network approaches have shown great success in various SLU tasks. However, one major difficulty of SLU is that the annotation of collected data can be expensive. Of-ten this results in insufficient data being available for a task. The performance of a neural network trained in low resource conditions is usually inferior because of over-training. To improve the performance, this paper investigates the use of unsupervised training methods with large-scale corpora based on word embedding and latent topic models to pre-train the SLU networks. In order to capture long-term characteristics over the entire dialog, we pro-pose a novel Recurrent Neural Network (RNN) architecture. The proposed RNN uses two sub-networks to model the different time scales represented by word and turn sequences. The combination of pre-training and RNN gives us a 18 % relative error reduction compared to a baseline system
LEARNING DEEP REPRESENTATIONS, EMBEDDINGS AND CODES FROM THE PIXEL LEVEL OF NATURAL AND MEDICAL IMAGES
, 2013
"... Significant research has gone into engineering representations that can identify high-level semantic structure in images, such as objects, people, events and scenes. Recently there has been a shift towards learning representations of images either on top of dense features or directly from the pixel ..."
Abstract
- Add to MetaCart
Significant research has gone into engineering representations that can identify high-level semantic structure in images, such as objects, people, events and scenes. Recently there has been a shift towards learning representations of images either on top of dense features or directly from the pixel level. These features are often learned in hierarchies using large amounts of unlabeled data with the goal of removing the need for hand-crafted representations. In this thesis we consider the task of learning two specific types of image representations from standard size RGB images: a semi-supervised dense low-dimensional embedding and an unsupervised sparse binary code. We introduce a new algorithm called the deep matching pursuit network