Results 1  10
of
223
Factor Graphs and the SumProduct Algorithm
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 1998
"... A factor graph is a bipartite graph that expresses how a "global" function of many variables factors into a product of "local" functions. Factor graphs subsume many other graphical models including Bayesian networks, Markov random fields, and Tanner graphs. Following one simple computational rule, t ..."
Abstract

Cited by 1175 (67 self)
 Add to MetaCart
A factor graph is a bipartite graph that expresses how a "global" function of many variables factors into a product of "local" functions. Factor graphs subsume many other graphical models including Bayesian networks, Markov random fields, and Tanner graphs. Following one simple computational rule, the sumproduct algorithm operates in factor graphs to computeeither exactly or approximatelyvarious marginal functions by distributed messagepassing in the graph. A wide variety of algorithms developed in artificial intelligence, signal processing, and digital communications can be derived as specific instances of the sumproduct algorithm, including the forward/backward algorithm, the Viterbi algorithm, the iterative "turbo" decoding algorithm, Pearl's belief propagation algorithm for Bayesian networks, the Kalman filter, and certain fast Fourier transform algorithms.
Gradientbased learning applied to document recognition
 Proceedings of the IEEE
, 1998
"... Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify hi ..."
Abstract

Cited by 742 (59 self)
 Add to MetaCart
Multilayer neural networks trained with the backpropagation algorithm constitute the best example of a successful gradientbased learning technique. Given an appropriate network architecture, gradientbased learning algorithms can be used to synthesize a complex decision surface that can classify highdimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2D) shapes, are shown to outperform all other techniques. Reallife document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN’s), allows such multimodule systems to be trained globally using gradientbased methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.
Hierarchical mixtures of experts and the EM algorithm
 Neural Computation
, 1994
"... We present a treestructured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM’s). Learning is treated as a maximum likelihood ..."
Abstract

Cited by 724 (19 self)
 Add to MetaCart
We present a treestructured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM’s). Learning is treated as a maximum likelihood problem; in particular, we present an ExpectationMaximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an online learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain. 1
The Infinite Hidden Markov Model
 Machine Learning
, 2002
"... We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. Th ..."
Abstract

Cited by 490 (32 self)
 Add to MetaCart
We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying statetransition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infiniteconsider, for example, symbols being possible words appearing in English text.
Parallel Networks that Learn to Pronounce English Text
 COMPLEX SYSTEMS
, 1987
"... This paper describes NETtalk, a class of massivelyparallel network systems that learn to convert English text to speech. The memory representations for pronunciations are learned by practice and are shared among many processing units. The performance of NETtalk has some similarities with observed h ..."
Abstract

Cited by 457 (5 self)
 Add to MetaCart
This paper describes NETtalk, a class of massivelyparallel network systems that learn to convert English text to speech. The memory representations for pronunciations are learned by practice and are shared among many processing units. The performance of NETtalk has some similarities with observed human performance. (i) The learning follows a power law. (;i) The more words the network learns, the better it is at generalizing and correctly pronouncing new words, (iii) The performance of the network degrades very slowly as connections in the network are damaged: no single link or processing unit is essential. (iv) Relearning after damage is much faster than learning during the original training. (v) Distributed or spaced practice is more effective for longterm retention than massed practice. Network models can be constructed that have the same performance and learning characteristics on a particular task, but differ completely at the levels of synaptic strengths and singleunit responses. However, hierarchical clustering techniques applied to NETtalk reveal that these different networks have similar internal representations of lettertosound correspondences within groups of processing units. This suggests that invariant internal representations may be found in assemblies of neurons intermediate in size between highly localized and completely distributed representations.
Understanding Normal and Impaired Word Reading: Computational Principles in QuasiRegular Domains
 PSYCHOLOGICAL REVIEW
, 1996
"... We develop a connectionist approach to processing in quasiregular domains, as exemplified by English word reading. A consideration of the shortcomings of a previous implementation (Seidenberg & McClelland, 1989, Psych. Rev.) in reading nonwords leads to the development of orthographic and phonologi ..."
Abstract

Cited by 368 (87 self)
 Add to MetaCart
We develop a connectionist approach to processing in quasiregular domains, as exemplified by English word reading. A consideration of the shortcomings of a previous implementation (Seidenberg & McClelland, 1989, Psych. Rev.) in reading nonwords leads to the development of orthographic and phonological representations that capture better the relevant structure among the written and spoken forms of words. In a number of simulation experiments, networks using the new representations learn to read both regular and exception words, including lowfrequency exception words, and yet are still able to read pronounceable nonwords as well as skilled readers. A mathematical analysis of the effects of word frequency and spellingsound consistency in a related but simpler system serves to clarify the close relationship of these factors in influencing naming latencies. These insights are verified in subsequent simulations, including an attractor network that reproduces the naming latency data directly in its time to settle on a response. Further analyses of the network's ability to reproduce data on impaired reading in surface dyslexia support a view of the reading system that incorporates a graded divisionoflabor between semantic and phonological processes. Such a view is consistent with the more general Seidenberg and McClelland framework and has some similarities withbut also important differences fromthe standard dualroute account.
Connectionist Learning Procedures
 ARTIFICIAL INTELLIGENCE
, 1989
"... A major goal of research on networks of neuronlike processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way ..."
Abstract

Cited by 339 (6 self)
 Add to MetaCart
A major goal of research on networks of neuronlike processing units is to discover efficient learning procedures that allow these networks to construct complex internal representations of their environment. The learning procedures must be capable of modifying the connection strengths in such a way that internal units which are not part of the input or output come to represent important features of the task domain. Several interesting gradientdescent procedures have recently been discovered. Each connection computes the derivative, with respect to the connection strength, of a global measure of the error in the performance of the network. The strength is then adjusted in the direction that decreases the error. These relatively simple, gradientdescent learning procedures work well for small tasks and the new challenge is to find ways of improving their convergence rate and their generalization abilities so that they can be applied to larger, more realistic tasks.
The empirical case for two systems of reasoning
 Psychological Bulletin
, 1996
"... Distinctions have been proposed between systems of reasoning for centuries. This article distills properties shared by many of these distinctions and characterizes the resulting systems in light of recent findings and theoretical developments. One system is associative because its computations refle ..."
Abstract

Cited by 330 (3 self)
 Add to MetaCart
Distinctions have been proposed between systems of reasoning for centuries. This article distills properties shared by many of these distinctions and characterizes the resulting systems in light of recent findings and theoretical developments. One system is associative because its computations reflect similarity structure and relations of temporal contiguity. The other is "rule based " because it operates on symbolic structures that have logical content and variables and because its computations have the properties that are normally assigned to rules. The systems serve complementary functions and can simultaneously generate different solutions to a reasoning problem. The rulebased system can suppress the associative system but not completely inhibit it. The article reviews evidence in favor of the distinction and its characterization. One of the oldest conundrums in psychology is whether people are best conceived as parallel processors of information who operate along diffuse associative links or as analysts who operate by deliberate and sequential manipulation of internal representations. Are inferences drawn through a network of learned associative pathways or through application of a kind of "psychologic"
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
 Machine Learning
, 1992
"... Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinfor ..."
Abstract

Cited by 322 (0 self)
 Add to MetaCart
Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight adjustments in a direction that lies along the gradient of expected reinforcement in both immediatereinforcement tasks and certain limited forms of delayedreinforcement tasks, and they do this without explicitly computing gradient estimates or even storing information from which such estimates could be computed. Specific examples of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while others are novel but potentially interesting in their own right. Also given are results that show how such algorithms can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well as further considerations that might be used to help develop similar but potentially more powerful reinforcement learning algorithms.
On Language and Connectionism: Analysis of a Parallel Distributed Processing Model of Language Acquisition
 COGNITION
, 1988
"... Does knowledge of language consist of mentallyrepresented rules? Rumelhart and McClelland have described a connectionist (parallel distributed processing) model of the acquisition of the past tense in English which successfully maps many stems onto their past tense forms, both regular (walk/walked) ..."
Abstract

Cited by 280 (10 self)
 Add to MetaCart
Does knowledge of language consist of mentallyrepresented rules? Rumelhart and McClelland have described a connectionist (parallel distributed processing) model of the acquisition of the past tense in English which successfully maps many stems onto their past tense forms, both regular (walk/walked) and irregular (go/went), and which mimics some of the errors and sequences of development of children. Yet the model contains no explicit rules, only a set of neuronstyle units which stand for trigrams of phonetic features of the stem, a set of units which stand for trigrams of phonetic features of the past form, and an array of connections between the two sets of units whose strengths are modified during learning. Rumelhart and McClelland conclude that linguistic rules may be merely convenient approximate fictions and that the real causal processes in language use and acquisition must be characterized as the transfer of activation levels among units and the modification of the weights of their connections. We analyze both the linguistic and the developmental assumptions of the model in detail and discover that (1) it cannot represent certain words, (2) it cannot learn many rules, (3) it can learn rules found in no human language, (4) it cannot explain morphological and phonological regularities, (5) it cannot explain the differences between irregular and regular forms, (6) it fails at its assigned task of mastering the past tense of English, (7) it gives an incorrect explanation for two developmental phenomena: stages of overregularization of irregular forms such as bringed, and the appearance of doublymarked forms such as ated, and (8) it gives accounts of two others (infrequent overregularization of verbs ending in t/d, and the order of acquisition of different irregula...