## Recurrent Neural Networks and Prior Knowledge for Sequence Processing: A Constrained Nondeterministic Approach (1995)

### Cached

### Download Links

- [www.dsi.unifi.it]
- [ftp-dsi.ing.unifi.it]
- DBLP

### Other Repositories/Bibliography

Citations: | 14 - 5 self |

### BibTeX

@MISC{Frasconi95recurrentneural,

author = {Paolo Frasconi and Marco Gori and Giovanni Soda},

title = {Recurrent Neural Networks and Prior Knowledge for Sequence Processing: A Constrained Nondeterministic Approach},

year = {1995}

}

### OpenURL

### Abstract

this paper we focus on processing sequential streams of data by recurrent neural networks

### Citations

3837 |
Introduction to Automata Theory, Languages and Computation
- Hopcroft
- 1979
(Show Context)
Citation Context ...aper deals with first-order recurrent network models and prior knowledge expressed as a collection of transition rules for a finite automaton. Finite automata can be deterministic or nondeterministic =-=[24]-=-. A deterministic finite automaton 1 (DFA) is a 5-tuple A d = (q 0 ; Q; \Sigma; ffi; F ). The state transition function ffi maps a state-input pair (q t ; oe t ) into the next state q t+1 = ffi(q t ; ... |

2724 |
Learning Internal Representation by Error Propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...learning, particularly on N L 's neurons. The weights of network N L were adjusted by plain gradient descent and no constraints. Gradient was computed using the backpropagation through time algorithm =-=[37]-=-. The weights of network N k were adjusted by constrained gradient descent; the initial weights were computed using a linear programming technique described in [23] and constraints were enforced by si... |

1040 |
Linear and Nonlinear Programming
- Luenberger
- 1984
(Show Context)
Citation Context ...strained according to inequalities (27). This guarantees that the injected rules are preserved during learning. Constraints can be enforced using classic numerical techniques such as reduced gradient =-=[31]-=-. A different approach is used in [23], where weights are constrained to lie inside the largest hypersphere contained in the feasible weight region. A CASE STUDY: ISOLATED WORD RECOGNITION K-L network... |

609 |
Neural networks and the bias/variance dilemma
- Geman, Bienenstock, et al.
- 1992
(Show Context)
Citation Context ...le learning schemes based on adjusting coefficients can indeed be practical and valuable when the partial functions are reasonably matched on the task,... More recently, Geman, Bienenstock & Dourstat =-=[7]-=- clearly make it evident that poorly structured connectionist models undergo the same problems encountered in nonparametric statistical inference. The variance contribution to the estimation error can... |

315 |
What size net gives valid generalization
- Baum, Haussler
- 1989
(Show Context)
Citation Context ...s can be drawn using different points of view. In the PAC learning framework, for example, analyses based on the VC-dimension show how training flexible neural networks may be costly in terms of data =-=[8]-=-. Introducing partial knowledge about the input/output mapping to be learned may help to substantially reduce the number of training examples required to achieve a satisfactory level of generalization... |

306 |
Inductive inference: theory and methods
- Angluin, Smith
- 1983
(Show Context)
Citation Context ...d d ij is the delay associated to the link. The automaton operates in a symbolic domain, defined by a finite input alphabet \Sigma and a finite set of states Q. In tasks such as grammatical inference =-=[29]-=-, the input stream is merely symbolic and can be trivially converted into RNN inputs using unary encoding. However, in some real-world applications data may contain approximately symbolic information ... |

272 | Parallel Distributed Processing - Rumelhart, McClelland - 1986 |

253 | Learning longterm dependencies with gradient descent is difficult
- Bengio, Simard, et al.
- 1994
(Show Context)
Citation Context ...cerned, satisfactory learning may become even more problematic. Despite of the powerful representational capabilities of recurrent networks, many have reported difficulties in training them optimally =-=[14, 15]-=-. In fact, it can be proven that any parametric dynamical system, such as a recurrent neural network, will be increasingly difficult to train with gradient descent as the duration of the dependencies ... |

196 | Extracting refined rules from knowledge-based neural networks
- Towell, Shavlik
- 1993
(Show Context)
Citation Context ...the knowledge-based collection of clauses. A related research topic is the extraction of the learned rules from the trained network, in order to complete the refinement process in the symbolic domain =-=[3, 21, 27, 28]-=-. Rule extraction methods have also been proved to be successful to improve generalization to new instances, especially for long input strings. Why nondeterminism ? The papers referred to in the previ... |

180 | Refinement of approximate domain theories by knowledge-based neural networks
- Towell, Shavlik, et al.
- 1990
(Show Context)
Citation Context ...connectionist models have been proposed in the literature. Abu-Mostafa [16] and Al-Mashouq & Reed [9] have studied the problem of learning from examples and hints in static networks. Towell & Shavlik =-=[17]-=- have proposed methods for mapping a set of propositional rules into a two-layered architecture, referred to as knowledge-based artificial neural networks (Kbann). Another interesting approach, introd... |

172 |
Learning and extracting finite state automata with second-order recurrent neural networks
- Giles, Miller, et al.
- 1992
(Show Context)
Citation Context ...prior knowledge, rule insertion, nondeterministic finite automata. INTRODUCTION The integration of connectionist and symbolic processing approaches has recently received attention by many researchers =-=[1, 2, 3, 4, 5]-=-, mainly because it allows to jointly exploit the bottom-up (learning from data) and top-down (deductive reasoning) kinds of inference. A relevant aspect to such integration is the development of tech... |

166 |
Neural Network Learning and Expert Systems
- Gallant
- 1993
(Show Context)
Citation Context ...prior knowledge, rule insertion, nondeterministic finite automata. INTRODUCTION The integration of connectionist and symbolic processing approaches has recently received attention by many researchers =-=[1, 2, 3, 4, 5]-=-, mainly because it allows to jointly exploit the bottom-up (learning from data) and top-down (deductive reasoning) kinds of inference. A relevant aspect to such integration is the development of tech... |

103 |
Mechanisms of implicit learning: Connectionist models of sequence processing
- Cleeremans
- 1993
(Show Context)
Citation Context ...prior knowledge, rule insertion, nondeterministic finite automata. INTRODUCTION The integration of connectionist and symbolic processing approaches has recently received attention by many researchers =-=[1, 2, 3, 4, 5]-=-, mainly because it allows to jointly exploit the bottom-up (learning from data) and top-down (deductive reasoning) kinds of inference. A relevant aspect to such integration is the development of tech... |

93 |
Neural network design and the complexity of learning
- Judd
- 1990
(Show Context)
Citation Context ...such as backpropagation may fail to discover optimal solutions for highly structured problems. These feelings are theoretically confirmed by results showing the NP-completeness of the loading problem =-=[12, 13]-=-. As far as recurrent networks are concerned, satisfactory learning may become even more problematic. Despite of the powerful representational capabilities of recurrent networks, many have reported di... |

87 |
Learning from hints in neural networks
- Abu-Mostafa
- 1990
(Show Context)
Citation Context ...mit these difficulties, inasmuch as learning begins closer to a good solution. Many methods for introducing prior knowledge into connectionist models have been proposed in the literature. Abu-Mostafa =-=[16]-=- and Al-Mashouq & Reed [9] have studied the problem of learning from examples and hints in static networks. Towell & Shavlik [17] have proposed methods for mapping a set of propositional rules into a ... |

82 |
Modular construction of time-delay neural networks for speech recognition
- Waibel
- 1989
(Show Context)
Citation Context ...e lexicons. Basically, this is due to the intrinsic limitations of all the methods that rely only on learning by example. Although some solutions have been proposed for building modular architectures =-=[35, 33]-=-, the scaling up to large lexicons appears to a be a very serious problem. In order to overcome these difficulties, we propose to model each word of a given dictionary with a K-L net. Each one must de... |

80 |
Integrating Rules and Connectionism for Robust Commonsense Reasoning
- Sun
- 1994
(Show Context)
Citation Context |

72 | On the Problem of Local Minima in Backpropagation
- Gori, Tesi
- 1992
(Show Context)
Citation Context ... generalization [9]. In addition to the sample complexity, adaptive systems may also face difficulties due to the computational complexity. Analyses on the problem of local minima in the cost surface =-=[10, 11]-=-, for example, suggest that learning algorithms such as backpropagation may fail to discover optimal solutions for highly structured problems. These feelings are theoretically confirmed by results sho... |

70 | Constructing deterministic finitestate automata in recurrent neural networks
- Omlin, Giles
- 1996
(Show Context)
Citation Context ...tion, since the output of such neurons is supposed to be nearly zero. The value H is referred to as hint strength and it may have a significant influence on the training time. Recently, Omlin & Giles =-=[26]-=- studied the insertion of the full set of transition rules for a given DFA. They obtained a lower bound on H that guarantees the equivalence of the recurrent network and the original DFA (in the sense... |

55 |
Induction of multiscale temporal structure
- Mozer
- 1992
(Show Context)
Citation Context ...cerned, satisfactory learning may become even more problematic. Despite of the powerful representational capabilities of recurrent networks, many have reported difficulties in training them optimally =-=[14, 15]-=-. In fact, it can be proven that any parametric dynamical system, such as a recurrent neural network, will be increasingly difficult to train with gradient descent as the duration of the dependencies ... |

54 | A Framework for Combining Symbolic and Neural Learning
- Shavlik
- 1994
(Show Context)
Citation Context |

40 | Local feedback multilayered networks
- Frasconi, Gori, et al.
- 1992
(Show Context)
Citation Context ...= CQ (q t ) and u b t : = C \Sigma (oe t ) (i.e., with a superscript b ), to enlighten their close relationship with the corresponding analog quantities in the network. State memorization As shown in =-=[30]-=-, recurrent neurons have interesting properties concerning the possibility of latching the information represented by their activation. The long-term memory capability of recurrent networks is essenti... |

36 | Network structuring and training using rule-based knowledge
- Tresp, Hollatz, et al.
- 1993
(Show Context)
Citation Context ...ing a set of propositional rules into a two-layered architecture, referred to as knowledge-based artificial neural networks (Kbann). Another interesting approach, introduced by Tresp, Hollatz & Ahmad =-=[18], exploits-=- radial basis function units to "decode" the minterms occurring in symbolic rules; this last approach has the advantage of allowing a probabilistic interpretation of the resulting model. Pri... |

36 | Representation of finite state automata in recurrent radial basis function networks
- Frasconi, Gori, et al.
- 1996
(Show Context)
Citation Context ...the knowledge-based collection of clauses. A related research topic is the extraction of the learned rules from the trained network, in order to complete the refinement process in the symbolic domain =-=[3, 21, 27, 28]-=-. Rule extraction methods have also been proved to be successful to improve generalization to new instances, especially for long input strings. Why nondeterminism ? The papers referred to in the previ... |

35 | Training second-order recurrent neural networks using hints
- Omlin, Giles
- 1992
(Show Context)
Citation Context ...is last approach has the advantage of allowing a probabilistic interpretation of the resulting model. Prior knowledge injection algorithms for recurrent networks have been introduced by Omlin & Giles =-=[19]-=-, Maclin & Shavlik [20]. Both the algorithms (shortly described further on in this paper) assume that the available prior knowledge can be expressed in terms of state transition rules for a finite aut... |

34 |
Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks
- Giles, Omlin
- 1993
(Show Context)
Citation Context ...arning. These approaches, however, do not allow to model eventual uncertainty about the rules to be injected. Problems of convergence may arise if malicious rules are inserted instead of genuine ones =-=[21]-=-. Moreover, no explicit mechanism was provided to guarantee that the inserted knowledge will not be destroyed by the learning algorithm. In [22, 23] we proposed an algorithm that allows to specify dom... |

33 |
Training a 3-node neural net is NP-Complete
- Blum, Rivest
- 1989
(Show Context)
Citation Context ...such as backpropagation may fail to discover optimal solutions for highly structured problems. These feelings are theoretically confirmed by results showing the NP-completeness of the loading problem =-=[12, 13]-=-. As far as recurrent networks are concerned, satisfactory learning may become even more problematic. Despite of the powerful representational capabilities of recurrent networks, many have reported di... |

33 |
Unified integration of explicit rules and learning by example in recurrent networks
- Frasconi, Gori, et al.
- 1995
(Show Context)
Citation Context ...n L b i = (j;`)2L i P b j;` (54) with thresholds ae 2 ; ae 3 , using (25,26) and let\Omega /\Omega "\Omegas, being\Omegasthe region defined by (26); 53 end do APPENDIX B. Proofs Proof of Theorem =-=1 In [23]-=- it is proven that for w ii ? 1 and I i;t = 0 the equation x i;t = tanh(w ii x i;t\Gamma1 ) has two asymptotically stable equilibrium points x + and x \Gamma . Because of symmetry of sigmoid function,... |

26 | Credit assignment through time: Alternative to backpropagation
- Bengio
- 1994
(Show Context)
Citation Context ...the same time they make learning by gradient descent more difficult. Research in our group is in progress for investigating training algorithms alternative to gradient descent, like those proposed in =-=[38]-=-. ACKNOWLEGMENTS We wish to thank Marco Maggini for his assistance and Yoshua Bengio for his comments on a earlier draft of this paper. This research was supported by the Italian Government under MURS... |

24 |
A unified approach for integrating explicit knowledge and learning by example in recurrent networks
- Frasconi, Gori, et al.
- 1991
(Show Context)
Citation Context ...arameters and can be "forgotten" during learning. The longer a network is trained, the more likely it is to use information from the training data to arrive at a solution. Constructive metho=-=ds (e.g., [17, 22]-=-). A portion of the network is built using prior rules. Then more units and/or connections are added. In this approach there are two options: The prebuilt network may be frozen or its parameters may b... |

24 | First-order vs. Second-order Single Layer Recurrent Neural Networks
- Goudreau, Giles, et al.
- 1994
(Show Context)
Citation Context ...second-order connections is in the more powerful representational power of these networks, that turns out to be particularly natural for modeling state transitions of finite automata. It can be shown =-=[25]-=- that there exist simple regular languages that cannot be recognized by a recurrent network with first-order connections, unless a static layer of units is added. In [19] a local representation of the... |

15 | Refining domain theories expressed as finite-state automata, in - Maclin, Shavlik - 1991 |

14 | On the problem of local minima in recurrent neural networks
- Bianchini, Gori, et al.
- 1994
(Show Context)
Citation Context ... generalization [9]. In addition to the sample complexity, adaptive systems may also face difficulties due to the computational complexity. Analyses on the problem of local minima in the cost surface =-=[10, 11]-=-, for example, suggest that learning algorithms such as backpropagation may fail to discover optimal solutions for highly structured problems. These feelings are theoretically confirmed by results sho... |

10 |
Including hints in training neural nets
- Al-Mashouq, Reed
- 1991
(Show Context)
Citation Context ... Introducing partial knowledge about the input/output mapping to be learned may help to substantially reduce the number of training examples required to achieve a satisfactory level of generalization =-=[9]-=-. In addition to the sample complexity, adaptive systems may also face difficulties due to the computational complexity. Analyses on the problem of local minima in the cost surface [10, 11], for examp... |

4 |
On the Use of Neural Networks for Speaker Independent Isolated Word Recognition
- DeMichelis, Fissore, et al.
- 1989
(Show Context)
Citation Context ...ility of the proposed model to deal with isolated word recognition (IWR) in large lexicons. So far, many attempts to build neural-based classifiers for IWR have assumed "small" lexicons --- =-=see e.g.: [32, 33, 34]-=-. Neural classifiers have succeeded in problems of acoustic feature extraction, but have not exhibited significant results for applications to large lexicons. Basically, this is due to the intrinsic l... |

3 |
The multi-layer perceptron as a tool for speech pattern processing research
- Peeling, Moore, et al.
- 1986
(Show Context)
Citation Context ...ility of the proposed model to deal with isolated word recognition (IWR) in large lexicons. So far, many attempts to build neural-based classifiers for IWR have assumed "small" lexicons --- =-=see e.g.: [32, 33, 34]-=-. Neural classifiers have succeeded in problems of acoustic feature extraction, but have not exhibited significant results for applications to large lexicons. Basically, this is due to the intrinsic l... |

3 |
Recurrent Networks for Continuous Speech Recognition
- Frasconi, Gori, et al.
- 1990
(Show Context)
Citation Context ... 10 other nets NW i ; i = 1::10 (see Figure 4), fed by N P 's outputs, are used for modeling the words. The phonetic network N P has a local feedback architecture, [30], and is described in detail in =-=[36]-=-. Each net NW i was devoted to detect a word of the dictionary and the highest network output criterion was used to perform word prediction. Figure 4 shows the particular net q 0 q 1 q 2 q 3 q 5 q 7 q... |

2 |
Design of hierarchical perceptron structures and their application to the task of isolated word recognition
- Kammerer, Kupper
- 1989
(Show Context)
Citation Context ...ility of the proposed model to deal with isolated word recognition (IWR) in large lexicons. So far, many attempts to build neural-based classifiers for IWR have assumed "small" lexicons --- =-=see e.g.: [32, 33, 34]-=-. Neural classifiers have succeeded in problems of acoustic feature extraction, but have not exhibited significant results for applications to large lexicons. Basically, this is due to the intrinsic l... |

2 | C.Lee Giles (Eds - Cowan - 1992 |

1 | Sun and Y (' Lee Learning and extracted finite state automata with secon&order recurrent neural networks, Neural Computation, Vol 4. No , pp _193405 - Chen, Chen, et al. - 1992 |

1 | Combining symbolic and neural learning - Shavhk - 1994 |

1 | Neural networks and the bas/variance dilemma - Bienenstock, Dourstar - 1992 |

1 | The induction ofmultiscale temporal structure in - Mozer - 1992 |

1 | Chakradhar and D Chem First-order vs. second-order single layer recurrent neural networks - Goudreau, L, et al. - 1994 |

1 | 29] D. Angluin and - No - 1993 |

1 | and G Sod& Local teedback multilayered networks - Frasconi, Gori - 1992 |

1 | Linear and Nonhnear Programming - Luenberger - 1984 |

1 | A Kupper, Design of hierarchical perceptton structures and their application to the task of isolated word recognitiom - Karometer, W - 1989 |

1 | The multilayer perceptton as a tool for speech pattern processing research - Peeling, Moore, et al. - 1986 |

1 | each input vector u,, consider the sign vector s, {-1,+1 }m: S,. t: sign(u,.,) and for each neuron V consider the sign vector s (/) {-1,+1} m (t) /+l i = l S i -1 il From Definition 7 and from Step 2 of Algorithm 3, it follows that and consequently at mos - Prop |