## Machine learning for sequential data: A review (2002)

### Cached

### Download Links

- [eecs.oregonstate.edu]
- [web.engr.oregonstate.edu]
- [www.cs.orst.edu]
- [www.cs.ubc.ca]
- [people.cs.ubc.ca]
- [www.damas.ift.ulaval.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | Structural, Syntactic, and Statistical Pattern Recognition |

Citations: | 84 - 1 self |

### BibTeX

@INPROCEEDINGS{Dietterich02machinelearning,

author = {Thomas G. Dietterich},

title = {Machine learning for sequential data: A review},

booktitle = {Structural, Syntactic, and Statistical Pattern Recognition},

year = {2002},

pages = {15--30},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract. Statistical learning problems in many fields involve sequential data. This paper formalizes the principal learning tasks and describes the methods that have been developed within the machine learning research community for addressing these problems. These methods include sliding window methods, recurrent sliding windows, hidden Markov models, conditional random fields, and graph transformer networks. The paper also discusses some open research issues. 1

### Citations

8134 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...te variables st in addition to the output labels yt. Sequential interactions are modeled by the st variables. To handle these hidden variables during training, the 11s12 Expectation-Maximization (EM; =-=[6]-=-) algorithm is applied. Bengio and Frasconi [2] report promising results on various artificial sequential supervised learning and sequence classification problems. Unfortunately, the MEMM and IOHMM mo... |

4956 |
C4.5: Programs for machine learning
- Quinlan
- 1993
(Show Context)
Citation Context ...ar measures) forms the basis of recursive-partioning algorithms for growing classification and regression trees. These methods incorporate the choice of relevant features into the treegrowing process =-=[3, 21]-=-. Unfortunately, this measure does not capture interactions between features. Several methods have been developed that identify such interactions including RELIEFF [14], Markov blankets [13], and feat... |

3927 |
Classification and Regression Trees
- Breiman
- 1984
(Show Context)
Citation Context ...ar measures) forms the basis of recursive-partioning algorithms for growing classification and regression trees. These methods incorporate the choice of relevant features into the treegrowing process =-=[3, 21]-=-. Unfortunately, this measure does not capture interactions between features. Several methods have been developed that identify such interactions including RELIEFF [14], Markov blankets [13], and feat... |

2732 |
Learning internal representations by error propagation
- Rumelhart, Hinton, et al.
- 1986
(Show Context)
Citation Context ...ks are usually trained iteratively via a procedure known as backpropagation-through-time (BPTT) in which the network structure is “unrolled” for the length of the input and output sequences xi and yi =-=[22]-=-. Recurrent networks have been applied to a variety of sequence-learning problems [9]. 3.3 Hidden Markov Models and Related Methods The hidden Markov Model (HMM; see Figure 2(a)) is a probabilistic mo... |

2320 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- LAFFERTY, MCCALLUM, et al.
- 2001
(Show Context)
Citation Context ...y3 = 2 remains equally split. So the MEMM has completely ignored the “i”! The same problem occurs with the hidden states st of the IOHMM. 3.4 Conditional Random Fields Lafferty, McCallum, and Pereira =-=[15]-=- introduced the conditional random field (CRF; Figure 2(d)) to try to overcome the label bias problem. In the CRF, the relationship among adjacent pairs yt−1 and yt is modeled as an Markov Random Fiel... |

1544 | Finding Structure in Time
- Elman
- 1990
(Show Context)
Citation Context ...at time t. This allows the network to develop a representation for the recurrent information that is separate from the representation of the output y values. This architecture was introduced by Elman =-=[7]-=-. These networks are usually trained iteratively via a procedure known as backpropagation-through-time (BPTT) in which the network structure is “unrolled” for the length of the input and output sequen... |

1036 | Wrappers for feature subset selection
- Kohavi, JOHN
- 1997
(Show Context)
Citation Context ...e predictions. In standard supervised learning, this is known as the feature selection problem, and there are four primary strategies for solving it. The first strategy, known as the wrapper approach =-=[12]-=-, is to generate various subsets of features and evaluate them by running the learning algorithm and measuring the accuracy of the resulting classifier (e.g., via cross-validation or by applying the A... |

942 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
(Show Context)
Citation Context ... useless features to become very small (perhaps even zero). Examples of this approach include ridge regression [10], neural network weight elimination [24], and L1-norm support vector machines (SVMs; =-=[5]-=-). The third strategy is to compute some measure of feature relevance and remove low-scoring features. One of the simplest measures is the mutual information between a feature and the class. This (or ... |

741 | Gradient-based learning applied to document recognition
- Lecun, Bottou, et al.
- 1998
(Show Context)
Citation Context ...eas the HMM must rely on default observation probabilities for these words. 3.5 Graph Transformer Networks In a landmark paper on handwritten character recognition, LeCun, Bottou, Bengio, and Haffner =-=[16]-=- describe a neural network methodology for solving complex sequential supervised learning problems. The architecture that they propose is shown in Figure 3. A graph transformer network is a neural net... |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...], Markov blankets [13], and feature racing [17]. 5s6 The fourth strategy is to first fit a simple model and then analyze the fitted model to identify the relevant features. For example, Chow and Liu =-=[4]-=- describe an efficient algorithm for fitting a tree-structured Bayesian network to a data set. This network can then be analyzed to remove features that have low influence on the class. Kristin Bennet... |

493 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ... the values of parameters in the fitted model. This causes the parameters associated with useless features to become very small (perhaps even zero). Examples of this approach include ridge regression =-=[10]-=-, neural network weight elimination [24], and L1-norm support vector machines (SVMs; [5]). The third strategy is to compute some measure of feature relevance and remove low-scoring features. One of th... |

457 | Parallel networks that learn to pronounce English text
- Sejnowski, Rosenberg
- 1987
(Show Context)
Citation Context ...ating the yt’s to form the predicted sequence y. The obvious advantage of this sliding window method is that permits any classical supervised learning algorithm to be applied. Sejnowski and Rosenberg =-=[23]-=- applied the backpropagation neural network algorithm with a 7-letter sliding window to the task of pronouncing English words. A similar approach (but with a 15-letter window) was employed by Qian and... |

440 | Maximum entropy Markov models for information extraction and segmentation
- McCallum, Freitag
(Show Context)
Citation Context ...nsure that the probabilities sum to 1. Each fα is a boolean feature that can depend on yt and on any properties of the input sequence x. For example, in their experiments with MEMMs, McCallum, et al. =-=[18]-=- employed features such as “x begins with a number”, “x ends with a question mark”, etc. Hence, MEMMs support long-distance interactions. The IOHMM is similar to the MEMM except that it introduces hid... |

365 | Toward optimal feature selection
- Koller, Sahami
- 1996
(Show Context)
Citation Context ...rocess [3, 21]. Unfortunately, this measure does not capture interactions between features. Several methods have been developed that identify such interactions including RELIEFF [14], Markov blankets =-=[13]-=-, and feature racing [17]. 5s6 The fourth strategy is to first fit a simple model and then analyze the fitted model to identify the relevant features. For example, Chow and Liu [4] describe an efficie... |

202 | Predicting the secondary structure of globular proteins using neural network models
- Qian, Sejnowski
- 1988
(Show Context)
Citation Context ...e backpropagation neural network algorithm with a 7-letter sliding window to the task of pronouncing English words. A similar approach (but with a 15-letter window) was employed by Qian and Sejnowski =-=[20]-=- to predict protein secondary structure from the protein’s sequence of amino acid residues. Provost and Fawcett [8] addressed the problem of cellular telephone cloning by applying the RL rule learning... |

176 |
Serial order: A parallel distributed processing approach,” Institute for Cognitive Science Report 8694
- Jordan
- 1986
(Show Context)
Citation Context ...a) (b) Fig. 1. Two recurrent network architectures: (a) outputs are fed back to hidden units; (b) hidden units are fed back to hidden units. The ∆ symbol indicates a delay of one time step. by Jordan =-=[11]-=-. Part (b) shows a network in which the hidden unit activations at time t − 1 are fed as additional inputs at time t. This allows the network to develop a representation for the recurrent information ... |

164 | Adaptive fraud detection
- Fawcett, Provost
- 1997
(Show Context)
Citation Context ...problems. For example, in cellular telephone fraud detection, each x describes a telephone call, and y is 0 if the call is legitimate and 1 if the call originated from a stolen (or cloned) cell phone =-=[8]-=-. Another example involves computer intrusion detection where each x describes a request for a computer network connection and y indicates whether that request is part of an intrusion attempt. A third... |

137 |
Generalization by Weightelimination with Application to Forecasting
- Weigend, Huberman, et al.
- 1991
(Show Context)
Citation Context ...model. This causes the parameters associated with useless features to become very small (perhaps even zero). Examples of this approach include ridge regression [10], neural network weight elimination =-=[24]-=-, and L1-norm support vector machines (SVMs; [5]). The third strategy is to compute some measure of feature relevance and remove low-scoring features. One of the simplest measures is the mutual inform... |

101 | Hoeffding races: Accelerating model selection search for classification and function approximation
- Maron, Moore
- 1994
(Show Context)
Citation Context ...tely, this measure does not capture interactions between features. Several methods have been developed that identify such interactions including RELIEFF [14], Markov blankets [13], and feature racing =-=[17]-=-. 5s6 The fourth strategy is to first fit a simple model and then analyze the fitted model to identify the relevant features. For example, Chow and Liu [4] describe an efficient algorithm for fitting ... |

98 | Input-output hmms for sequence processing
- Bengio, Frasconi
- 1996
(Show Context)
Citation Context ...s yt. Sequential interactions are modeled by the st variables. To handle these hidden variables during training, the 11s12 Expectation-Maximization (EM; [6]) algorithm is applied. Bengio and Frasconi =-=[2]-=- report promising results on various artificial sequential supervised learning and sequence classification problems. Unfortunately, the MEMM and IOHMM models suffer from a problem known as the label b... |

24 | Achieving High-Accuracy Text-to-Speech with Machine Learning
- Bakiri, Dietterich
- 1999
(Show Context)
Citation Context ...ter Level of Aggregation Method processing Word Letter Sliding Window 12.5 69.6 Recurrent Sliding Window Left-to-Right 17.0 67.9 Recurrent Sliding Window Right-to-Left 24.4 74.2 Bakiri and Dietterich =-=[1]-=- applied this technique to the English pronunciation problem using a 7-letter window and a decision-tree algorithm. Table 1 summarizes the results they obtained when training on 1000 words and evaluat... |

7 |
Special issue on dynamic recurrent neural networks
- Giles, Kuhn, et al.
- 1994
(Show Context)
Citation Context ...me (BPTT) in which the network structure is “unrolled” for the length of the input and output sequences xi and yi [22]. Recurrent networks have been applied to a variety of sequence-learning problems =-=[9]-=-. 3.3 Hidden Markov Models and Related Methods The hidden Markov Model (HMM; see Figure 2(a)) is a probabilistic model of the way in which the xi and yi strings are generated—that is, it is a represen... |

3 |
Simec, and Marko Robnik- ˇ Sikonja. Overcoming the myopia of inductive learning algorithms with RELIEFF
- Kononenko, E
- 1997
(Show Context)
Citation Context ...into the treegrowing process [3, 21]. Unfortunately, this measure does not capture interactions between features. Several methods have been developed that identify such interactions including RELIEFF =-=[14]-=-, Markov blankets [13], and feature racing [17]. 5s6 The fourth strategy is to first fit a simple model and then analyze the fitted model to identify the relevant features. For example, Chow and Liu [... |