## Nonparametric time series prediction through adaptive model selection (2000)

### Cached

### Download Links

- [www-ee.technion.ac.il]
- [www.ee.technion.ac.il]
- [webee.technion.ac.il]
- DBLP

### Other Repositories/Bibliography

Venue: | Machine Learning |

Citations: | 28 - 0 self |

### BibTeX

@INPROCEEDINGS{Meir00nonparametrictime,

author = {Ron Meir and Lisa Hellerstein},

title = {Nonparametric time series prediction through adaptive model selection},

booktitle = {Machine Learning},

year = {2000},

pages = {5--34}

}

### OpenURL

### Abstract

Abstract. We consider the problem of one-step ahead prediction for time series generated by an underlying stationary stochastic process obeying the condition of absolute regularity, describing the mixing nature of process. We make use of recent results from the theory of empirical processes, and adapt the uniform convergence framework of Vapnik and Chervonenkis to the problem of time series prediction, obtaining finite sample bounds. Furthermore, by allowing both the model complexity and memory size to be adaptively determined by the data, we derive nonparametric rates of convergence through an extension of the method of structural risk minimization suggested by Vapnik. All our results are derived for general L p error measures, and apply to both exponentially and algebraically mixing processes.

### Citations

1491 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...lts quoted in Remark 3. The main technical trick needed to establish this result has to do with the use of the Bernstein-Craig inequality (Craig, 1933), rather than the standard Hoeffding inequality (=-=Hoeffding, 1963-=-) used in the usual derivations of uniform laws of large numbers. Unfortunately, this approach does not seem to work for more general L p norms, with which we are concerned in this paper. Finally, it ... |

994 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ... margin which is allowed to shrink as N →∞. For the sake of clarity we do not proceed in this direction. For this function, it is easy to establish the following result (see for example Lemma 8.2 in (=-=Devroye, Györfi, & Lugosi, 1996-=-), the proof of which does not depend on the independence property). Lemma 2.1. Let ˆ fd,n,N be a function in Fd,n which minimizes the empirical error. Then L( fd,n,N ˆ ) − inf L( f ) ≤ 2 sup |L( f ) ... |

945 | On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications - Vapnik, Chervonenkis - 1971 |

802 |
Estimation of Dependencies Based on Empirical Data
- Vapnik
- 1982
(Show Context)
Citation Context ...by Grenander (1981) and studied further by Geman and Hwang (1982). This type of approach had in fact been introduced in the late 1970’s by Vapnik and titled by him Structural Risk Minimization (SRM) (=-=Vapnik, 1982-=-). The basic idea behind this approach, applied so far in the context of independent data, is the construction of a sequence of models of increasing complexity, where each model within the hierarchy i... |

647 | Time Series: Theory and Methods - Brockwell, Davis - 1991 |

627 | Constructive Approximation - DeVore, Lorentz - 1993 |

372 |
Decision theoretic generalizations of the PAC model for neural net and other learning applications
- Haussler
- 1992
(Show Context)
Citation Context ...obability is taken with respect to the product measure on Z N . Note that by using (4) we have written the covering number in Lemma 3.2 in terms of F rather than LF. Finally, we recall a result from (=-=Haussler, 1992-=-), which allows for extra flexibility and improved rates of convergence under certain conditions. We make use of this result in Sections 5 and 6. Lemma 3.3 (Haussler, 1992, Theorem 2). Let F be a perm... |

371 | Universal approximation bounds for superpositions of a sigmoidal function - Barron - 1993 |

341 | Weak Convergence and Empirical Processes - Vaart, Wellner - 1996 |

235 |
Optimal global rates of convergence for nonparametric regression
- Stone
- 1982
(Show Context)
Citation Context ...imilar results. The major advantage of these approaches is that while being adaptive in the above sense, they can often be shown to achieve the minimax rates of convergence in nonparametric settings (=-=Stone, 1982-=-) under i.i.d. conditions, showing that they are effective estimation schemes in this regime as well. In this work we extend the SRM idea to the case of time series. This extension is not entirely str... |

217 | Mixing: Properties and examples - Doukhan - 1994 |

205 |
Minimum complexity density estimation
- Barron, Cover
- 1991
(Show Context)
Citation Context ...and data compression (Feder & Merhav, 1996). We should also note that a similar approach based on the so-called index of resolvability has been pursued by Barron and co-workers in a series of papers (=-=Barron & Cover, 1991-=-; Barron, 1994), with similar results. The major advantage of these approaches is that while being adaptive in the above sense, they can often be shown to achieve the minimax rates of convergence in n... |

176 | Nonparametric Statistics for Stochastic Processes - Bosq - 1996 |

152 | 4bstmct inference - Grenander - 1981 |

115 |
Approximation and estimation bounds for artificial neural networks
- Barron, R
- 1994
(Show Context)
Citation Context ...Feder & Merhav, 1996). We should also note that a similar approach based on the so-called index of resolvability has been pursued by Barron and co-workers in a series of papers (Barron & Cover, 1991; =-=Barron, 1994-=-), with similar results. The major advantage of these approaches is that while being adaptive in the above sense, they can often be shown to achieve the minimax rates of convergence in nonparametric s... |

106 |
A Theory of Learning and Generalization
- Vidyasagar
- 1997
(Show Context)
Citation Context ...a certain Lipschitz condition is obeyed by the loss functions ℓ f (x, y). In particular, assume that for all y, x1, x2 and f |ℓ f (x1, y) − ℓ f (x2, y)| ≤η|f(x1)− f(x2)|. Then it can easily be shown (=-=Vidyasagar, 1996-=-, Sec. 7.1.3) that N � ɛ, LF(Z N � � ɛ ), l1,N ≤ N η , F(X N � ), l1,N , (4) where Z N ={Z1,...,ZN}={(X1,Y1),...,(XN,YN)}. Note that the empirical seminorm l1,N on the l.h.s. of (4) is taken with resp... |

93 | Sphere packing numbers for subsets of the Boolean n−cube with bounded Vapnik-Chervonenkis dimension
- Haussler
(Show Context)
Citation Context ...no such value exists the pseudo-dimension is infinite. The pseudo-dimension becomes useful due to the following result of Haussler and Long (1995), which relates it to the covering number. Lemma 5.2 (=-=Haussler, 1995-=-, Corollary 3). For any set X, any probability measure P on X, any set F of P-measurable functions taking values in the interval [0, B] with pseudodimension Pdim(F), and any ɛ>0, � �Pdim(F) 2eB N(ɛ, F... |

73 | Nonparametric maximum likelihood estimation by the method of sieves.” The Annals of Statistics 10 - Geman, Hwang - 1982 |

73 |
On a method of investigating periodicities in disturbed series, with special reference to wolfer’s sunspot numbers
- Yule
- 1927
(Show Context)
Citation Context ...el selection, structural risk minimization, mixing processes 1. Introduction The problem of time series modeling and prediction has a long history, dating back to the pioneering work of Yule in 1927 (=-=Yule, 1927-=-). Most of the work since then until the 1970s has been concerned with parametric approaches to the problem whereby a simple, usually linear, model is fitted to the data (for a review of this approach... |

68 | Efficient agnostic learning of neural networks with bounded fan-in
- Lee, Bartlett, et al.
- 1996
(Show Context)
Citation Context ...Remark 2. We have made Assumption 5.1 for convenience. It is known that there are situations where the pseudo-dimension is not the optimal quantity for computing upper bounds for the covering number (=-=Lee et al., 1996-=-; Lugosi & Zeger, 1995). However, in all these cases one obtains covering number bounds which behave like O(ɛ −D ) for some generalized dimension D. If this is the case, replace the pseudo-dimension b... |

61 | Fat-shattering and the learnability of real-valued functions - Bartlett, Long, et al. - 1996 |

60 | On-line algorithms in machine learning
- Blum
- 1997
(Show Context)
Citation Context ...f mixing parameters. Another related and very fruitful line of recent research has been devoted to the so called on-line approach to learning, where very few assumptions are made about the data (see (=-=Blum, 1996-=-) for a recent survey). In the most extreme case, no assumptions whatsoever are made, and an attempt is made to compare the performance of various on-line algorithms to that of the best algorithm with... |

53 | Markov processes. Structure and asymptotic behavior - Rosenblatt - 1971 |

48 | Risk bound for model selection via penalization - BARRON, BIRGÉ, et al. - 1999 |

47 | Polynomial bounds for VC dimension of sigmoidal and general pfaffian neural networks
- Karpinski, Macintyre
- 1997
(Show Context)
Citation Context ...erm of the form Kd,nɛ −Pdim(Fd,n) . Many examples of classes with finite pseudo-dimension are known. Two recently studied examples are neural networks with the standard sigmoidal activation function (=-=Karpinksi & Macintyre, 1997-=-) or with piecewise polynomial activation functions (Goldberg & Jerrum, 1995). In the latter case, rather tight bounds on the pseudo-dimension have recently been derived in (Bartlett et al., 1998). Re... |

43 |
Nonparametric Curve Estimation from Time Series
- Györfi, Härdle, et al.
- 1989
(Show Context)
Citation Context ... is mixing, unless it is Gaussian, Markov etc. Thus, Assumption 4.1, stringent as it is, cannot be avoided at this point. This type of assumption is used both in the work on nonparametric prediction (=-=Györfi et al., 1989-=-) and in the results using complexity regularization, as in Modha & Masry (1998). In order to motivate the mixing assumption, we recall two examples where exponential mixing has been established. Firs... |

43 | Neural networks for optimal approximation of smooth and analytic functions
- Mhaskar
- 1996
(Show Context)
Citation Context ...space Fd,n is such that inf f ∈Fd,n L( f ) ≤ cn−(k+1)/d for any f in the Sobolev space. This type of result is well known for spline functions, and has recently been demonstrated for neural networks (=-=Mhaskar, 1996-=-) and mixture of expert architectures (Zeevi et al., 1998). Using the results of Theorem 6.1, and assuming that the optimal memory size d is known, as in the nonparametric setting above, we can comput... |

41 |
Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability
- Yu
- 1994
(Show Context)
Citation Context ...of the Bernstein inequality to dependent data (White, 1991; Modha & Masry, 1998). The second approach, to be pursued here, is based on mapping the problem onto one characterized by an i.i.d. process (=-=Yu, 1994-=-), and the utilization of the standard results for the latter case. A comment is in order here concerning notation. Hatted variables will denote empirical estimates, while starred variables denote opt... |

37 | Adaptive model selection using empirical complexities,” Ann - Lugosi, Nobel - 1999 |

37 |
Nonparametric estimation via empirical risk minimization
- Lugosi, Zeger
- 1995
(Show Context)
Citation Context ...made Assumption 5.1 for convenience. It is known that there are situations where the pseudo-dimension is not the optimal quantity for computing upper bounds for the covering number (Lee et al., 1996; =-=Lugosi & Zeger, 1995-=-). However, in all these cases one obtains covering number bounds which behave like O(ɛ −D ) for some generalized dimension D. If this is the case, replace the pseudo-dimension by the dimension D, and... |

31 |
K.: Concept learning using complexity regularization
- Lugosi, Zeger
- 1996
(Show Context)
Citation Context ...rate which is similar to the one that would be attained had we known the true model in advance. In fact, exactly this type of adaptivity has been demonstrated recently for the case of classification (=-=Lugosi & Zeger, 1996-=-), regression (Lugosi & Nobel, 1996) and data compression (Feder & Merhav, 1996). We should also note that a similar approach based on the so-called index of resolvability has been pursued by Barron a... |

29 | Central Limit Theorems for Empirical and U-Processes of Stationary Mixing Sequences - Arcones, Yu - 1994 |

26 | Universal prediction of stationary random processes
- Modha, Masry
- 1996
(Show Context)
Citation Context ...e complexity regularization, is described in this work. In particular, the optimal memory size that should be used in order to form a predictor is in principle derivable from the procedure (see also (=-=Modha & Masry, 1998-=-)), given information about the mixing nature of the time series (see Section 4 for a definition of mixing). It is thus hoped that many of the successful Machine Learning approaches to modeling static... |

16 | Hierarchical universal coding
- Feder, Merhav
- 1996
(Show Context)
Citation Context ...odel in advance. In fact, exactly this type of adaptivity has been demonstrated recently for the case of classification (Lugosi & Zeger, 1996), regression (Lugosi & Nobel, 1996) and data compression (=-=Feder & Merhav, 1996-=-). We should also note that a similar approach based on the so-called index of resolvability has been pursued by Barron and co-workers in a series of papers (Barron & Cover, 1991; Barron, 1994), with ... |

13 | Learning dynamical systems in a stationary environment - Campi, Kumar - 1998 |

13 | On the Tchebycheff inequality of Bernstein,” The - Craig - 1933 |

13 |
Some results on sieve estimation with dependent observations
- White, Wooldridge
- 1991
(Show Context)
Citation Context ...s in the i.i.d case (Pollard, 1984) will not work here. To circumvent this problem, two approaches have been proposed. The first makes use of extensions of the Bernstein inequality to dependent data (=-=White, 1991-=-; Modha & Masry, 1998). The second approach, to be pursued here, is based on mapping the problem onto one characterized by an i.i.d. process (Yu, 1994), and the utilization of the standard results for... |

12 | A new approach to least-squares estimation, with applications - Geer - 1987 |

8 | Mixing properties of ARMA processes - Mokkadem - 1988 |

8 |
Convergence of Empirical Processes
- Pollard
- 1984
(Show Context)
Citation Context ...tic process ¯X, in order that a uniform law of large numbers may be established. In any event, it is obvious that the standard approach of using randomization and symmetrization as in the i.i.d case (=-=Pollard, 1984-=-) will not work here. To circumvent this problem, two approaches have been proposed. The first makes use of extensions of the Bernstein inequality to dependent data (White, 1991; Modha & Masry, 1998).... |

6 | Error bounds for functional approximation and estimation using mixtures of experts. Information Theory
- Zeevi, Meir, et al.
- 1998
(Show Context)
Citation Context ...)/d for any f in the Sobolev space. This type of result is well known for spline functions, and has recently been demonstrated for neural networks (Mhaskar, 1996) and mixture of expert architectures (=-=Zeevi et al., 1998-=-). Using the results of Theorem 6.1, and assuming that the optimal memory size d is known, as in the nonparametric setting above, we can compute the value for the complexity index n which yields faste... |

5 |
Bounding the VC dimension of concept classes parameterized by real numbers
- Goldberg, Jerrum
- 1995
(Show Context)
Citation Context ...dimension are known. Two recently studied examples are neural networks with the standard sigmoidal activation function (Karpinksi & Macintyre, 1997) or with piecewise polynomial activation functions (=-=Goldberg & Jerrum, 1995-=-). In the latter case, rather tight bounds on the pseudo-dimension have recently been derived in (Bartlett et al., 1998). Remark 2. We have made Assumption 5.1 for convenience. It is known that there ... |

2 |
Almost linear vc dimension bounds for piecwise polynomial networks
- Bartlett, Maiorov, et al.
- 1998
(Show Context)
Citation Context ... d . In the derivation we have assumed that γd,n = Pdim(Fd,n) ∝ nq for some positive value of q, which is a typical situation (see examples in Vidyasagar (1996)). For example, we have recently shown (=-=Bartlett, Maiorov, & Meir, 1998-=-) that q = 1+ɛ (ɛ >0 arbitrarily small) for feedforward neural networks composed of piecewise polynomial activation functions, while Karpinski and Macintyre (1997) have established q = 4 for networks ... |