## Sequential PAC Learning (1995)

### Cached

### Download Links

- [classes.cec.wustl.edu]
- [www.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedigs of COLT-95 |

Citations: | 14 - 5 self |

### BibTeX

@INPROCEEDINGS{Schuurmans95sequentialpac,

author = {Dale Schuurmans and Russell Greiner},

title = {Sequential PAC Learning},

booktitle = {In Proceedigs of COLT-95},

year = {1995},

pages = {377--384}

}

### OpenURL

### Abstract

We consider the use of "on-line" stopping rules to reduce the number of training examples needed to pac-learn. Rather than collect a large training sample that can be proved sufficient to eliminate all bad hypotheses a priori, the idea is instead to observe training examples one-at-a-time and decide "on-line" whether to stop and return a hypothesis, or continue training. The primary benefit of this approach is that we can detect when a hypothesizer has actually "converged," and halt training before the standard fixed-sample-size bounds. This paper presents a series of such sequential learning procedures for: distribution-free pac-learning, "mistake-bounded to pac" conversion, and distribution-specific pac-learning, respectively. We analyze the worst case expected training sample size of these procedures, and show that this is often smaller than existing fixed sample size bounds --- while providing the exact same worst case pac-guarantees. We also provide lower bounds that show these r...

### Citations

1754 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ... 1\Gammaffi. Of course, the difficulty of achieving this criterion depends on our prior knowledge of c and P. Here we will consider two distinct models of prior knowledge: the distribution-free model =-=[Val84]-=-, where the target concept c is known to belong to some class C, but nothing is known about the domain distribution P; and the distribution-specific model [BI88a, Kul91], where the domain distribution... |

1144 | Instance-based learning algorithms - Aha, Kibler, et al. - 1991 |

976 |
On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications
- Vapnik, Chervonenkis
- 1971
(Show Context)
Citation Context ...; ffi ) = 1 ffl ln jCj ffi random training examples are sufficient to ensure F pac(ffl; ffi )-learns C. For infinitesconcept classes, Blumer et al. [BEHW89] use the results of Vapnik and Chervonenkis =-=[VC71]-=- to show that for any (well behaved 1 ) concept class C with vc(C)=d TBEHW (C; ffl; ffi ) = max \Phi 8d ffl log 2 13 ffl ; 4 ffl log 2 2 ffi \Psi random examples are sufficient for Procedure F to solv... |

701 | The weighted majority algorithm - Littlestone, Warmuth - 1994 |

687 | Learning quickly when irrelevant attributes abound: A new linear threshold algorithm
- Littlestone
- 1988
(Show Context)
Citation Context ...t a concept from a finite class can always be learned while making a finite number of mistakes, in an on-line model where the learner produces a hypothesis after each example and tests it on the next =-=[Lit88]. In later-=- work [Lit89] he showed how a hypothesizersH with a small mistake bound could be converted into a data-efficient pac-learner. Littlestone develops a "two phase" conversion procedure Li that,... |

640 |
Learnability and the Vapnik-Chervonenkis dimension
- Blumer, Ehrenfeucht, et al.
- 1989
(Show Context)
Citation Context ...s. E.g., for finite concept classes T finite (C; ffl; ffi ) = 1 ffl ln jCj ffi random training examples are sufficient to ensure F pac(ffl; ffi )-learns C. For infinitesconcept classes, Blumer et al. =-=[BEHW89]-=- use the results of Vapnik and Chervonenkis [VC71] to show that for any (well behaved 1 ) concept class C with vc(C)=d TBEHW (C; ffl; ffi ) = max \Phi 8d ffl log 2 13 ffl ; 4 ffl log 2 2 ffi \Psi rand... |

532 |
Sequential Analysis
- Wald
- 1947
(Show Context)
Citation Context ...tial probability pac-learned (unless standard cryptographic assumptions are false) [KV89], Schapire [Sch92] has demonstrated a polytime learning procedure for (��formulae; uniform). ratio test (sp=-=rt) [Wal47]-=- to see whether any has sufficiently small error. We show (Theorem 1) that S correctly solves any pac-learning problem (C; ffl; ffi ) for which d = vc(C) ! 1, ffl ? 0, ffi ? 0. An analysis of S's data... |

322 |
What size net gives valid generalization
- Baum, Haussler
- 1989
(Show Context)
Citation Context ...n TBEHW ! Moreover, this average was only 3 times larger than the empirical "rule of thumb" that w ffl training examples are needed to achieve ffl error, for a concept class defined by w fre=-=e weights [BH89]-=-. Not only do these results scale up well for harder problems (Figure 5), they are also robust to changes in the target concept, domain distribution, and concept class (with the same VCdimension) [SG9... |

312 | Cryptographic limitations on learning boolean formulae and nite automata
- Kearns, Valiant
- 1989
(Show Context)
Citation Context ... by keeping a list of hypotheses (produced by some consistent hypothesizer), testing each one "on-line" with a sequential probability pac-learned (unless standard cryptographic assumptions a=-=re false) [KV89], Sc-=-hapire [Sch92] has demonstrated a polytime learning procedure for (��formulae; uniform). ratio test (sprt) [Wal47] to see whether any has sufficiently small error. We show (Theorem 1) that S corre... |

196 |
A general lower bound on the number of examples needed for learning
- Ehrenfeucht, Hanssler, et al.
- 1989
(Show Context)
Citation Context ...s C with vc(C)=d TBEHW (C; ffl; ffi ) = max \Phi 8d ffl log 2 13 ffl ; 4 ffl log 2 2 ffi \Psi random examples are sufficient for Procedure F to solve (C; ffl; ffi ). 2 In addition, Ehrenfeucht et al. =-=[EHKV89]-=- have shown that no learning procedure can observe fewer than t EHKV (C; ffl; ffi ) = max \Phi d\Gamma1 32ffl ; 1\Gammaffl ffl ln 1 ffi \Psi random training examples and still meet the pac(ffl; ffi )-... |

177 |
Real Analysis and Probability
- Ash
- 1972
(Show Context)
Citation Context ... \Delta : The only catch now is that ET sprt contains a problematic E ln TH term. However, this can be bounded by E ln THsln ETH , using Jensen's inequality and the fact that ln is concave; see e.g., =-=[Ash72]-=-. The rest follows from algebraic manipulation.sAlthough this is a crude bound, it is interesting to note that it scales the same as TBEHW and T STAB . Moreover, this bound actually beats TBEHW and T ... |

90 |
From on-line to batch learning
- Littlestone
- 1989
(Show Context)
Citation Context ...es. Here we show (Proposition 5) that a variant of Procedure S can perform "mistake bounded to pac" conversion while using strictly fewer training examples (on average) than the procedure pr=-=oposed in [Lit89]-=-. In fact, our procedure uses substantiallysfewer training examples in empirical tests. Finally, in Section 3 we address the distribution-specific model of pac-learning. Here we introduce a variant of... |

90 | Optimal Stopping Rules - Shiryayev - 1978 |

61 |
The Design and Analysis of Efficient Learning Algorithms
- Schapire
- 1992
(Show Context)
Citation Context ...t of hypotheses (produced by some consistent hypothesizer), testing each one "on-line" with a sequential probability pac-learned (unless standard cryptographic assumptions are false) [KV89],=-= Schapire [Sch92] has-=- demonstrated a polytime learning procedure for (��formulae; uniform). ratio test (sprt) [Wal47] to see whether any has sufficiently small error. We show (Theorem 1) that S correctly solves any pa... |

40 |
Learnability by fixed distributions
- Benedek, Itai
- 1988
(Show Context)
Citation Context ...; ffl; ffi ) for which C has a finite "ffl=2-cover" under dP . We show (Theorem 7) that Scov uses about 5 times fewer training examples (on average) than the fixed-sample-size procedure intr=-=oduced in [BI88a]-=-. However, a lower bound result (Theorem 8) shows that sequential learning does not increase the range of pac-learnable concept spaces. 1.4 Significance and related work Overall, these results show ho... |

28 | Results on learnability and the Vapnik-Chervonenkis dimension - Linial, Mansour, et al. - 1988 |

25 | Bounding Sample Size with the VapnikChervonenkis Dimension
- Shawe-Taylor, Anthony, et al.
- 1989
(Show Context)
Citation Context ...sfies certain benign measurability restrictions. All concept classes we consider are assumed to be suitably "well behaved" in this manner. 2 This result has since been improved by Shawe-Tayl=-=or et al. [STAB93]-=- to TSTAB(C; ffl; ffi) = 1 ffl(1\Gamma p ffl) \Gamma 2d ln 6 ffl + ln 2 ffi \Delta . 3 This is a different motivation from using distributional assumptions to reduce the computational complexity of pa... |

23 | The perceptron algorithm is fast for nonmalicious distributions - Baum - 1990 |

13 | E cient learning of continuous neural networks - Koiran - 1994 |

12 | Nonuniform learnability - BENEDEK, ITAI - 1988 |

12 | Investigating the distribution assumptions in the PAC learning model - Bartlett, Williamson - 1991 |

6 | Apple tasting and nearly one-sided learning - Helmbold, Littlestone, et al. - 1992 |

5 | Practical PAC learning
- Schuurmans, Greiner
- 1995
(Show Context)
Citation Context ...nd T STAB for extremely small values of ffi (Proposition 3). However, we note that S's true dataefficiency is decoupled from any precise bounds we can prove about its performance, and empirical tests =-=[SG95]-=- show that S actually uses many times fewer training examples in practice. Finally, we prove (Theorem 4) that these results cannot be substantially improved upon, as any learner must always observe an... |

4 |
Effective Classification Learning
- Schuurmans
- 1995
(Show Context)
Citation Context ...L it can be shown that, given P defined as above, there must be some c 0 2 C for which Pf dP (H T ; c 0 ) ? ffl fi fi T c 0 ! U gs1 7 . (This involves generalizing the proof of [EHKV89, Lemma 2]; see =-=[Sch95]-=- for complete details.) Combining (1)--(4) shows that, for any k ? 1, if ET csd\Gamma1 32kffl for all c 2 C, then there must be some c 0 2 C for which PfdP (H T ; c 0 ) ? fflgs1 7 \Theta 1 \Gamma 1 k ... |