Results 1 
8 of
8
A Neural Probabilistic Language Model
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2003
"... A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen ..."
Abstract

Cited by 145 (12 self)
 Add to MetaCart
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on ngrams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on stateoftheart ngram models, and that the proposed approach allows to take advantage of longer contexts.
Data Mining in Soft Computing Framework: A Survey
 IEEE Transactions on Neural Networks
, 2001
"... The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the mode ..."
Abstract

Cited by 61 (3 self)
 Add to MetaCart
The present article provides a survey of the available literature on data mining using soft computing. A categorization has been provided based on the different soft computing tools and their hybridizations used, the data mining function implemented, and the preference criterion selected by the model. The utility of the different soft computing methodologies is highlighted. Generally fuzzy sets are suitable for handling the issues related to understandability of patterns, incomplete/noisy data, mixed media information and human interaction, and can provide approximate solutions faster. Neural networks are nonparametric, robust, and exhibit good learning and generalization capabilities in datarich environments. Genetic algorithms provide efficient search algorithms to select a model, from mixed media data, based on some preference criterion/objective function. Rough sets are suitable for handling different types of uncertainty in data. Some challenges to data mining and the application of soft computing methodologies are indicated. An extensive bibliography is also included.
Quick Training of Probabilistic Neural Nets by Importance Sampling
, 2003
"... Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
Our previous work on statistical language modeling introduced the use of probabilistic feedforward neural networks to help dealing with the curse of dimensionality. Training this model by maximum likelihood however requires for each example to perform as many network passes as there are words in the vocabulary. Inspired by the contrastive divergence model, we propose and evaluate samplingbased methods which require network passes only for the observed "positive example" and a few sampled negative example words. A very significant speedup is obtained with an adaptive importance sampling.
Probabilistic Neural Network Models for Sequential Data
"... It has already been shown how Artificial Neural Networks (ANNs) can be incorporated into probabilistic models. In this paper we review some of the approaches which have been proposed to incorporate them into probabilistic models of sequential data, such as Hidden Markov Models (HMMs). We also discus ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
It has already been shown how Artificial Neural Networks (ANNs) can be incorporated into probabilistic models. In this paper we review some of the approaches which have been proposed to incorporate them into probabilistic models of sequential data, such as Hidden Markov Models (HMMs). We also discuss new developments and new ideas in this area, in particular how ANNs can be used to model highdimensional discrete and continuous data to deal with the curse of dimensionality, and how the ideas proposed in these models could be applied to statistical language modeling to represent longerterm context than allowed by trigram models, while keeping wordorder information. 1
A Neural Probabilistic Language Model
"... Abstract A goal of statistical language modeling is to learn the joint probabilityfunction of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons.In the proposed approach one learns simultaneously (1) a distributed r ..."
Abstract
 Add to MetaCart
Abstract A goal of statistical language modeling is to learn the joint probabilityfunction of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons.In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2)the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words thathas never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report onexperiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a stateoftheart trigram model. 1 Introduction A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in asentence, or discrete attributes in a datamining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V ofsize 100,000, there are potentially
A Neural Probabilistic Language Model
"... Abstract A goal of statistical language modeling is to learn the joint probabilityfunction of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons.In the proposed approach one learns simultaneously (1) a distributed r ..."
Abstract
 Add to MetaCart
Abstract A goal of statistical language modeling is to learn the joint probabilityfunction of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons.In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2)the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words thathas never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report onexperiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a stateoftheart trigram model. 1 Introduction A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in asentence, or discrete attributes in a datamining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary V ofsize 100,000, there are potentially
The Restricted Boltzmann Machine (Smolensky, 1986;
"... This is a discussion of Larochelle and Murray (2011). ..."
Input Variable Selection Using Parallel Processing of RBF Neural Networks
, 2007
"... Abstract: In this paper we propose a new technique focused on the selection of the important input variable for modelling complex systems of function approximation problems, in order to avoid the exponential increase in the complexity of the system that is usual when dealing with many input variable ..."
Abstract
 Add to MetaCart
Abstract: In this paper we propose a new technique focused on the selection of the important input variable for modelling complex systems of function approximation problems, in order to avoid the exponential increase in the complexity of the system that is usual when dealing with many input variables. The proposed parallel processing approach is composed of complete radial basis function neural networks that are in charge of a reduced set of input variables depending in the general behaviour of the problem. For the optimization of the parameters of each radial basis function neural networks in the system, we propose a new method to select the more important input variables which is capable of deciding which of the chosen variables go alone or together to each radial basis function neural networks to build the parallel structure, thus reducing the dimension of the input variable space for each radial basis function neural networks. We also provide an algorithm which automatically finds the most suitable topology of the proposed parallel processing structure and selects the more important input variables for it. Therefore, our goal is to find the most suitable of the proposed families of parallel processing architectures in order to approximate a system from which a set of input/output. So that the proposed parallel processing structure outperforms other algorithms not only with respect to the final approximation error but also with respect to the number of computation parameters of the system.