#### DMCA

## Learning with a Slowly Changing Distribution (1992)

Venue: | In Proc. 5th Annu. Workshop on Comput. Learning Theory |

Citations: | 15 - 3 self |

### Citations

1984 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ... class of hypotheses). The aim is for the algorithm to guess the label only if its hypothesis is accurate with high probability (taken over all sequences of random examples, as in Valiant's pac model =-=[Val84]-=-). Kramer is concerned with the minimum number of labelled examples that a successful algorithm of this type must store. In contrast, the results presented here give bounds on the misclassification pr... |

1148 |
On the Uniform Convergence of Relative Frequencies of events to their probabilities
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...x 2 S : f(x) = 1g : f 2 Fgj = 2 jSj : The Vapnik-Chervonenkis dimension (VC-dimension) of F is the size of the largest shattered subset of X, VCdim(F ) = maxfm : 9S ` X jSj = m and F shatters Sg (see =-=[VC71]-=-). The learning model described here is similar to the prediction model of learning described in [HLW90]. We have a domain X, a class F of functions that map from X to f0; 1g (the target class), and a... |

963 | Estimation of Dependences Based on Empirical Data - Vapnik - 1982 |

726 | Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...f(x)) when the meaning is clear from the context. We assume throughout that every set is measurable ([HL91] gives a supporting argument, claiming that in practice the domain we consider is countable; =-=[BEHW89]-=- gives sufficient conditions for the assumption when the domain is R n ). If x = (x 1 ; x 2 ; : : : ; x t ) 2 X t and oe is a permutation on f1; 2; : : : ; tg, define x oe = \Gamma x oe(1) ; x oe(2) ;... |

655 |
On measure of entropy and information
- Rényi
- 1961
(Show Context)
Citation Context ...P; Q) is also known as the information of order 1 of P with respect to Q. It can be interpreted as the amount of information obtained from observing an event E for which P (\Delta) = Q(\DeltajE) (see =-=[Ren61]-=-). The following proposition shows that a bound on dKL is a stronger requirement than a bound on d. Proposition 7 The Kullback-Leibler divergence d V is related to d by d 2sdKL =2: Moreover, there are... |

93 |
Equivalence of models for polynomial learnability
- Haussler, Kearns, et al.
- 1991
(Show Context)
Citation Context ...tly pac-learnable (that is, learnable in polynomial time), then there is an efficient randomized consistent hypothesis finder (and hence an efficient randomized consistent prediction strategy) for F (=-=[HKLW88]-=-, Theorem 4.1). We use a bound on the probability that a consistent deterministic strategy makes a mistake on the last example. Lemma 16 If H is a set of functions from X to f0; 1g, with VCdim(H) = ds... |

86 |
A lower bound for discrimination information in terms of variation.
- Kullback
- 1967
(Show Context)
Citation Context ... on d. Proposition 7 The Kullback-Leibler divergence d V is related to d by d 2sdKL =2: Moreover, there are distributions P and Q for which d(P; Q)sfl but dKL (P; Q) = 1, for 0 ! fls1. Proof Kullback =-=[Kul67]-=- shows that dKLsd 2 V =2 + d 4 V =12. Proposition 5 gives the desired inequality. To see that d does not provide an upper bound on dKL , consider the distributions P and Q and the set fx 1 ; x 2 g ` X... |

49 |
Predicting f0; 1g-functions on randomly drawn points
- Haussler, Littlestone, et al.
- 1990
(Show Context)
Citation Context ... characteristics (and therefore error rates) change, and as parts of the network fail. We consider two models of learning. The first is similar to Haussler, Littlestone and Warmuth's prediction model =-=[HLW90]-=----the aim of learning is to minimize the probability over all sequences of examples of misclassifying the last example. The second is a more general model that allows noise and errors in the examples... |

37 | A result of Vapnik with applications.
- Anthony, Shawe-Taylor
- 1993
(Show Context)
Citation Context ... immediately, using D(A) = ED (1 A ) for any distribution D, where 1A is the indicator function for A (1 A (x) is 1 when x 2 A and 0 otherwise).sThe following Lemma is due to Anthony and ShaweTaylor (=-=[AST90]-=-, Proposition 3.2). It improves on a similar result presented by Blumer et al. ([BEHW89], Theorem A3.1). Lemma 22 Define BD ; H; t; fi; ffl; d; ffi as in Theorem 20. For any distribution D on S, D t (... |

17 |
Tracking Drifting Concepts Using Random Examples
- Helmbold, Long
- 1991
(Show Context)
Citation Context ...t, the results presented here give bounds on the misclassification probability for an optimal algorithm as a function of the number of examples and the amount of distribution drift. Helmbold and Long =-=[HL91]-=- consider learning a slowly changing subset of the domain, when the distribution of examples is constant. This problem, and the problem of learning a fixed subset with a changing distribution, are two... |

1 |
Learnability and formal concept analysis
- Anthony, Biggs, et al.
- 1990
(Show Context)
Citation Context ...e proof of Theorem 4.1 in [HLW90]. It is based on Theorem A2.1 in [BEHW89] and Sauer's Lemma ([BEHW89], Proposition A2.1). Instead, we could use the corresponding exponential bound in Theorem 3.12 of =-=[ABST90]-=-, since it has better constants. However this would complicate the statement and proof of the following theorem. Theorem 17 For any hypothesis class H with VCdim(H) = d and 1sd ! 1, any consistent pre... |

1 | Learning despite distribution drift - Kramer - 1988 |