### Citations

3667 |
Adaptation in Natural and Artificial Systems
- Holland
- 1975
(Show Context)
Citation Context ...applications including advertizement [21, 52], economics [29, 85], games [59] and optimization [77, 48, 76, 35]. It can be a central building block of larger systems, like in evolutionary programming =-=[68]-=- and reinforcement learning [119], in particular in large state space Markovian Decision Problems [79]. The name “bandit” comes from imagining a gambler in a casino playing with K slot machines, where... |

2781 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ... guarantees presented here are also stronger than the ones associated with L0-regularization (penalization proportional to the number of nonzero coefficient) whatever criterion (Mallows’ Cp [96], AIC =-=[3]-=- or BIC [116]) is used to tune the penalty constant. Recent advances on theoretical guarantees of L0-regularization can be found in the works of Bunea, Tsybakov and Wegkamp [36] and of Birgé and Massa... |

2166 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...√ ) ) ¯ℓt Vt , . (E.1) 2 PROOF. Let Λ(λ) = log Ee λ(U−EU) be the log-Laplace transform of the random variable U − EU. Let St = ∑ t i=1 (Ui − EUi) with the convention S0 = 0. From Inequality (2.17) of =-=[67]-=-, we have 1 i=1 P ( max 1≤t≤n St ≥ s ) ≤ inf λ>0 e−λs+nΛ(λ) . Let V = Var U. Hoeffding’s inequality and Bennett’s inequality implies ( 2 λ Λ(λ) ≤ min 8 , (eλ ) − 1 − λ)V , which by standard computatio... |

1309 | A probabilistic Theory of Pattern Recognition., volume 31 of Applications of Mathematics - Devroye, Györfi, et al. - 1996 |

789 | Finite-time analysis of the multiarmed bandit problem
- Auer, Cesa-Bianchi, et al.
(Show Context)
Citation Context ...thms in a more general setting that have also logarithmic expected regret (at the price of a higher numerical constant in the upper bound on the regret). More recently, Auer, Cesa-Bianchi and Fischer =-=[18]-=- have proposed even simpler policies achieving logarithmic regret uniformly over time rather than just for a fixed number n of rounds known in advance by the agent. Besides, unlike previous works, the... |

764 | Learning the kernel matrix with semi-definite programming
- Lanckriet, Cristianini, et al.
- 2004
(Show Context)
Citation Context ...dditive models, in which linear combinations of a fixed number of functions are replaced by functional spaces [104], such as reproducing kernel Hilbert spaces in the cases of multiple kernel learning =-=[86, 23, 111, 108, 22, 81]-=-. Finally, the most important limitation, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental p... |

500 |
Asymptotically efficient adaptive allocation rules
- Lai, Robbins
- 1985
(Show Context)
Citation Context ...ehaves. 4.2.3. INTRODUCTION TO UPPER CONFIDENCE BOUNDS POLICIES. Early papers have studied stochastic bandit problems under Bayesian assumptions (e.g., Gittins [61]). On the contrary, Lai and Robbins =-=[84]-=- have considered a parametric minimax framework. They have introduced an algorithm that follows what is now called the “optimism in the face of uncertainty principle”. At time t ≡ kt (mod K) with kt ∈... |

478 | The nonstochastic multiarmed bandit problem
- Auer, Cesa-Bianchi, et al.
(Show Context)
Citation Context ... an arm are independent and identically distributed random variables that are also independent from the rewards obtained from the other arms. Since the work of Auer, Cesa-Bianchi, Freund and Schapire =-=[19]-=-, it was also studied in an adversarial setting. To set the notation, let K ≥ 2 be the number of actions (or arms) and n ≥ K be the time horizon. A K-armed bandit problem is a game between an agent an... |

441 | Multiple kernel learning, conic duality, and the SMO algorithm
- Bach, Lanckriet, et al.
- 2004
(Show Context)
Citation Context ...dditive models, in which linear combinations of a fixed number of functions are replaced by functional spaces [104], such as reproducing kernel Hilbert spaces in the cases of multiple kernel learning =-=[86, 23, 111, 108, 22, 81]-=-. Finally, the most important limitation, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental p... |

424 | Bandit based Monte-Carlo planning
- Kocsis, Szepesvári
- 2006
(Show Context)
Citation Context ..., 76, 35]. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large state space Markovian Decision Problems =-=[79]-=-. The name “bandit” comes from imagining a gambler in a casino playing with K slot machines, where at each round, the gambler pulls the arm of any of the machines and gets a payoff as a result. The se... |

399 |
Some comments on cp
- Mallows
- 1973
(Show Context)
Citation Context ...tions. The guarantees presented here are also stronger than the ones associated with L0-regularization (penalization proportional to the number of nonzero coefficient) whatever criterion (Mallows’ Cp =-=[96]-=-, AIC [3] or BIC [116]) is used to tune the penalty constant. Recent advances on theoretical guarantees of L0-regularization can be found in the works of Bunea, Tsybakov and Wegkamp [36] and of Birgé ... |

369 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...5, 131] and for general loss in [72]. Here my first contribution was to take ideas coming from the field of sequential prediction of nonrandom sequences (see e.g. [107, 46] for a general overview and =-=[65, 44, 45, 134]-=- for more specific results with sharp constants) and propose a slight generalization of progressive mixture rules, that I called progressive indirect mixture rules. The progressive indirect mixture ru... |

318 | M.K.: Exponentiated gradient versus gradient descent for linear predictors
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...lts is to apply the progressive mixture rule on an appropriate grid of the simplex [123]. Another solution is to use the exponentiated gradient algorithm introduced and studied by Kivinen and Warmuth =-=[74]-=- in the context of sequential prediction for the quadratic loss, and then extended to general loss functions by Cesa-Bianchi [43]. Lemma 7 has to be invoked to convert these algorithms and the bounds ... |

316 | A Distribution-Free Theory of Nonparametric Regression - Györfi, Kohler, et al. - 2002 |

272 | Consistency of the group lasso and multiple kernel learning
- Bach
- 2008
(Show Context)
Citation Context ...dditive models, in which linear combinations of a fixed number of functions are replaced by functional spaces [104], such as reproducing kernel Hilbert spaces in the cases of multiple kernel learning =-=[86, 23, 111, 108, 22, 81]-=-. Finally, the most important limitation, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental p... |

243 | Lasso-type recovery of sparse representations for high-dimensional data
- MEINSHAUSEN, YU
- 2009
(Show Context)
Citation Context ...near combination of only s ≪ d variables among {g1, . . . , gd}, the typical result is to prove that the expected excess risk of the Lasso estimator for λ of order √ (log d)/n is of order (s log d)/n =-=[36, 124, 105, 93]-=-. Since this quantity is much smaller than d/n, this makes a huge improvement (provided that the sparsity assumption is true). This kind of results usually requires strong conditions on the eigenvalue... |

186 | Structured variable selection with sparsity-inducing norms, Arxiv preprint arXiv:0904.3523
- Jenatton, Audibert, et al.
- 2009
(Show Context)
Citation Context ...ronger type of results would require strong assumptions on the input vector distribution, that are often not met in practice. In the fixed design setting, for overlapping groups, Jenatton, Bach and I =-=[70]-=- have proved a high dimensional variable consistency result extending the corresponding result for the Lasso [138, 128]. Second, the approach does not extend easily to the case of generalized additive... |

179 | Universal prediction
- Merhav, Feder
- 1998
(Show Context)
Citation Context ...squares and entropy losses [38, 39, 25, 131] and for general loss in [72]. Here my first contribution was to take ideas coming from the field of sequential prediction of nonrandom sequences (see e.g. =-=[107, 46]-=- for a general overview and [65, 44, 45, 134] for more specific results with sharp constants) and propose a slight generalization of progressive mixture rules, that I called progressive indirect mixtu... |

172 | Using confidence bounds for exploitation-exploration trade-offs
- Auer
(Show Context)
Citation Context ...P3 is the one proposed in Section 6.8 of [46] (and the one presented in Section 4.3.2). It (slightly) overestimates the rewards since we have EIt∼pt˜gi,t = gi,t + β . This idea was introduced pi,t in =-=[17]-=- for tracking the best expert. In [12], we have introduced the tightly biased version of EXP3 to achieve regret bounds depending on the performance of the optimal arm. Contrarily to the reward-magnify... |

167 |
Central limit theorems for empirical measures
- Dudley
- 1978
(Show Context)
Citation Context ...way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. The interest in generic chaining rather than just Dudley’s chaining =-=[55]-=- comes from the fact that it captures better the behaviour supremum of a Gaussian process [120]. In statistical learning theory, the process of interest and which is asymptotically Gaussian is g ↦→ R(... |

164 | Convexity, classification, and risk bounds
- Bartlett, Jordan, et al.
- 2006
(Show Context)
Citation Context ...ty of this plug-in estimator is directly linked to the quality of the least squares regression estimator (see [53, Section 6.2], [16] and specifically the comparison lemmas of its section 5, and also =-=[95, 27, 28]-=- for consistency results in classification using other surrogate loss functions). Boosting type classification methods usually aggregate simple functions, but the aggregation is also of interest when ... |

163 | Local rademacher complexities
- Bartlett, Bousquet, et al.
(Show Context)
Citation Context ...ificity of PM is that its proof of optimality is not achieved by the most prominent tool in statistical learning theory: bounds on the supremum of empirical processes (see [125], and refined works as =-=[26, 80, 99, 34]-=- and references within). The idea of the proof, which comes back to Barron [24], is based on a chain rule and appeared to be successful for least squares and entropy losses [38, 39, 25, 131] and for g... |

151 | Learning the kernel function via regularization
- Micchelli, Pontil
(Show Context)
Citation Context |

148 | Information-theoretic determination of minimax rates of convergence
- Yang, Barron
- 1999
(Show Context)
Citation Context ...d works as [26, 80, 99, 34] and references within). The idea of the proof, which comes back to Barron [24], is based on a chain rule and appeared to be successful for least squares and entropy losses =-=[38, 39, 25, 131]-=- and for general loss in [72]. Here my first contribution was to take ideas coming from the field of sequential prediction of nonrandom sequences (see e.g. [107, 46] for a general overview and [65, 44... |

146 | Aggregation for Gaussian regression
- Bunea, Tsybakov, et al.
- 2007
(Show Context)
Citation Context ... , , d ) , 1 , n where the infimum is taken over all estimators. The three aggregation tasks have also been studied in the least squares regression with fixed design, where similar rates are obtained =-=[36, 50, 51]-=-. This chapter will provide my contributions to the aggregation problems (in the random design setting) summarized as follows. • The expected excess risk ER(ˆg) − R(g∗ MS ) of the empirical risk minim... |

146 | Smooth discrimination analysis
- Mammen, Tsybakov
- 1999
(Show Context)
Citation Context ...i n λj P[g1(X)=g2(X)] } } . 16To illustrate this last theoretical guarantee, let us consider complexity and margin assumptions similar to the ones used in the pioneering work of Mammen and Tsybakov =-=[97]-=-. To detail these assumptions, let d be the (pseudo-)distance on G(X; Y) defined by d(g1, g2) = P[g1(X) = g2(X)]. Let G ⊂ G(X; Y). For u {> 0, the set N ⊂ G(X; } Y) is called a u-covering net of ′ ′ ... |

141 |
Minimum contrast estimators on sieves: exponential bounds and rates of convergence
- Birgé, Massart
- 1998
(Show Context)
Citation Context ...g(X)g(X) T ] . Both results require at least exponential moments on the conditional distribution of the output Y knowing the input vector g(X). It can be derived from the work of Birgé and Massart =-=[31]-=- an excess risk d log n bound for the empirical risk minimizer of order at worst , and asymptotin cally of order d/n. It holds with high probability, for a bounded set Θ and requires bounded input vec... |

131 | Hoeffding races: Accelerating model selection search for classification and function approximation
- Maron, Moore
- 1994
(Show Context)
Citation Context ... bound in the context of racing algorithms. Racing algorithms aim to reduce the computational burden of performing tasks such as model selection using a hold-out set by discarding poor models quickly =-=[98, 112]-=-. The context of racing algorithms is the one of multi-armed bandit problems. Let ε > 0 be the confidence level parameter. A racing algorithm either terminates when it runs out of time (i.e. at the en... |

126 |
Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138
- Birgé, Massart
- 2007
(Show Context)
Citation Context ... BIC [116]) is used to tune the penalty constant. Recent advances on theoretical guarantees of L0-regularization can be found in the works of Bunea, Tsybakov and Wegkamp [36] and of Birgé and Massart =-=[32]-=- for 4 The functions g1, . . . , gd can be called the explanatory variables of the output. Note also that we can consider without loss of generality that the input space is R d and that the functions ... |

126 |
Local Rademacher complexities and oracle inequalities in risk minimization
- KOLTCHINSKII
- 2006
(Show Context)
Citation Context ...ificity of PM is that its proof of optimality is not achieved by the most prominent tool in statistical learning theory: bounds on the supremum of empirical processes (see [125], and refined works as =-=[26, 80, 99, 34]-=- and references within). The idea of the proof, which comes back to Barron [24], is based on a chain rule and appeared to be successful for least squares and entropy losses [38, 39, 25, 131] and for g... |

123 |
Sample mean based index policies with O(log(n)) regret for the multi-armed bandit problem
- Agrawal
- 1995
(Show Context)
Citation Context ... logarithmic rate with the number of trials and that the algorithm achieves the smallest possible regret up to some sub-logarithmic additive term (for the considered family of distributions). Agrawal =-=[2]-=- proposed computationally easier UCB algorithms in a more general setting that have also logarithmic expected regret (at the price of a higher numerical constant in the upper bound on the regret). Mor... |

115 | Nearly tight bounds for the continuum-armed bandit problem
- Kleinberg
- 2004
(Show Context)
Citation Context ...s the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [21, 52], economics [29, 85], games [59] and optimization =-=[77, 48, 76, 35]-=-. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large state space Markovian Decision Problems [79]. The... |

114 |
Some applications of concentration inequalities to statistics
- Massart
- 2000
(Show Context)
Citation Context ...ificity of PM is that its proof of optimality is not achieved by the most prominent tool in statistical learning theory: bounds on the supremum of empirical processes (see [125], and refined works as =-=[26, 80, 99, 34]-=- and references within). The idea of the proof, which comes back to Barron [24], is based on a chain rule and appeared to be successful for least squares and entropy losses [38, 39, 25, 131] and for g... |

112 | Learning the kernel with hyperkernels
- Ong, Smola, et al.
(Show Context)
Citation Context |

98 |
On tail probabilities for martingales
- Freedman
- 1975
(Show Context)
Citation Context ...−1 ) t2 √ n log(3ε−1 ) + 2t2 (1 − 3 ¯ ) 2 Vt) . This bound can be seen as an improvement of Inequality (5.27) of Blanchard [33]. For t = n ≥ 2, i.e. without the stopping time argument due to Freedman =-=[57]-=- allowing to have the inequality uniformly over time, Maurer and Pontil [101] improves on the constants of the above inequality when the empirical variance is close to 0. Considering the unbiased vari... |

94 | PAC-Bayesian Model Averaging
- McAllester
- 1999
(Show Context)
Citation Context ...where the second inequality uses Jensen’s inequality and Shannon’s entropy: H(ρ) = − ∑ g∈G ρ(g) log ρ(g). This is to be compared to the first PAC-Bayesian bound from the pioneering work of McAllester =-=[102]-=-, which states that with probability at least 1 − ε, for any distribution ρ ∈ M, we have √ K(ρ, π) + log(n) + 2 + log(ε−1 ) Eg∼ρR(g) − Eg∼ρr(g) ≤ . 2n − 1 The main difference is that the Shannon entro... |

86 | Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators
- Lounici
- 2008
(Show Context)
Citation Context ...near combination of only s ≪ d variables among {g1, . . . , gd}, the typical result is to prove that the expected excess risk of the Lasso estimator for λ of order √ (log d)/n is of order (s log d)/n =-=[36, 124, 105, 93]-=-. Since this quantity is much smaller than d/n, this makes a huge improvement (provided that the sparsity assumption is true). This kind of results usually requires strong conditions on the eigenvalue... |

85 | Sequential prediction of individual sequences under general loss functions
- Haussler, Kivinen, et al.
- 1998
(Show Context)
Citation Context ...5, 131] and for general loss in [72]. Here my first contribution was to take ideas coming from the field of sequential prediction of nonrandom sequences (see e.g. [107, 46] for a general overview and =-=[65, 44, 45, 134]-=- for more specific results with sharp constants) and propose a slight generalization of progressive mixture rules, that I called progressive indirect mixture rules. The progressive indirect mixture ru... |

85 | Multi-armed bandits in metric spaces
- Kleinberg, Slivkins, et al.
- 2008
(Show Context)
Citation Context ...s the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [21, 52], economics [29, 85], games [59] and optimization =-=[77, 48, 76, 35]-=-. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large state space Markovian Decision Problems [79]. The... |

81 |
The continuum-armed bandit problem
- Agrawal
- 1926
(Show Context)
Citation Context ...sidering product distributions on G×G, i.e. ρ = ρ1 ⊗ ρ2 with ρ1 and ρ2 distributions on G(X; Y). This standard argument transforms (A) into the following assertion holding for losses taking values in =-=[0, 1]-=-. For any λ > 0 and (prior) distributions π1 and π2 in M, with probability at least 1 − ε, for any ρ1 ∈ M and ρ2 ∈ M, Eg2∼ρ2R(g2)−Eg1∼ρ1R(g1) ≤ Eg2∼ρ2r(g2) − Eg1∼ρ1r(g1) + λ n Ψ ( ) λ ( Eg2∼ρ2Eg1∼ρ1EZ... |

79 | Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems
- Even-Dar, Mannor, et al.
(Show Context)
Citation Context ...igure 4.5. Another variant of the best arm identification task is the problem of minimal sampling times required to identify an ϵ-optimal arm with a given confidence level, see in particular [54] and =-=[56]-=-. In [62], Steffen Grünewälder, Manfred Opper, John Shawe-Taylor and I also study a non-cumulative regret notion, but in the context of a continuum of arms. Precisely, we consider the scenario in whic... |

78 |
de Geer. Taking advantage of sparsity in Multi-Task learning
- Lounici, Pontil, et al.
- 2009
(Show Context)
Citation Context ...combination defining g (group) . This type of results has not been obtained yet for the group Lasso [135] even when assuming low correlation between the variables, except for the fixed design setting =-=[69, 94]-=-. We have presented in this section an example of theoretical results easily obtainable from the estimators solving problems (MS) and (L). The results are expressed in terms of sub-exponential excess ... |

75 | High-dimensional additive models
- Meier, Geer, et al.
(Show Context)
Citation Context ...Lasso [138, 128]. Second, the approach does not extend easily to the case of generalized additive models, in which linear combinations of a fixed number of functions are replaced by functional spaces =-=[104]-=-, such as reproducing kernel Hilbert spaces in the cases of multiple kernel learning [86, 23, 111, 108, 22, 81]. Finally, the most important limitation, which is often encountered when using classical... |

74 | On the bayes-risk consistency of regularized boosting methods
- Lugosi, Vayatis
(Show Context)
Citation Context ...ty of this plug-in estimator is directly linked to the quality of the least squares regression estimator (see [53, Section 6.2], [16] and specifically the comparison lemmas of its section 5, and also =-=[95, 27, 28]-=- for consistency results in classification using other surrogate loss functions). Boosting type classification methods usually aggregate simple functions, but the aggregation is also of interest when ... |

69 | Bandit algorithms for tree search
- Coquelin, Munos
- 2007
(Show Context)
Citation Context ...s the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [21, 52], economics [29, 85], games [59] and optimization =-=[77, 48, 76, 35]-=-. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large state space Markovian Decision Problems [79]. The... |

64 |
PAC-Bayesian supervised classification: the thermodynamics of statistical learning
- Catoni
- 2007
(Show Context)
Citation Context ...ssumption of the loss function and states that for any λ > 0, with probability at least 1 − ε, for any ρ ∈ M, − n λ Eg∼ρ λ − log EZe n ℓ(Y,g(X)) ≤ Eg∼ρr(g) + K(ρ, π) + log(ε−1 ) . (Z) λ Catoni’s book =-=[41]-=- concentrates on the classification task. Instead of using λ − log Ee n ℓ(Y,g(X)) ≤ − λ n λ2 R(g) + Ψ n2 ( λ n which would give (C1) from (Z), Catoni used the equality ) R(g), λ − log Ee n ℓ(Y,g(X)) =... |

64 | An optimal algorithm for monte carlo estimation
- Dagum, Karp, et al.
- 1995
(Show Context)
Citation Context ...the algorithm is upper bounded by ( 2 σ T ≤ C · max δ2 ) ( 1 , log µ 2 δ|µ| ( ) ( 2 + log log ε 3 )) . δ|µ| Up to the log log term, this is optimal according to the work of Dagum, Karp, Luby and Ross =-=[49]-=-. Besides, our experimental simulations show that it significantly outperforms previously known stopping rules, in particular AA [49] and the Nonmonotonic Adaptive Sampling (NAS) algorithm due to Domi... |

64 | Exploration exploitation in go: Uct for montecarlo go
- Gelly, Wang
- 2006
(Show Context)
Citation Context ...armed bandit problem is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [21, 52], economics [29, 85], games =-=[59]-=- and optimization [77, 48, 76, 35]. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large state space Mar... |

63 | Empirical bernstein stopping
- Mnih, Szepesvári, et al.
- 2008
(Show Context)
Citation Context ...icular, for any ε > 0, with probability at least 1 − ε, for any t ∈ {1, . . . , n}, we have ∣ ∣Ūt − EU ∣ √ 2nVt ¯ log(3ε < −1 ) t2 + 3n log(3ε−1 ) t2 . (4.2.13) Inequality (4.2.13) is the one used in =-=[15, 109]-=-, but its tighter version (4.2.12) should be preferred. The proof of this lemma is given in Appendix E. Fot t = n, the lemma is an empirical version of Bernstein’s inequality, which differs from the l... |

62 | The importance of convexity in learning with squared loss
- Lee, Bartlett, et al.
- 1980
(Show Context)
Citation Context ... any of its penalized variants are really poor algorithms in this task since their expected convergence rate cannot be uniformly faster than √ (log d)/n. The following lower bound comes from [8] (see =-=[92]-=-, [39, p.14], [90, 72, 106] for similar results and variants). THEOREM 5 For any training set size n, there exist d prediction functions g1, . . . , gd taking their values in [−1, 1] such that for any... |

62 | Simplified PAC-bayesian margin bounds
- McAllester
- 2003
(Show Context)
Citation Context ...ave shown that the approach is indeed useful, and that PAC-Bayesian bounds lead to tight bounds, which are often representative of the risk behaviour even for relatively small training sets (see e.g. =-=[88, 103, 82]-=- for margin-based bounds from Gaussian prior distributions, [83] for an Adaboost setting, that is majority vote of weak learners, [118] in a clustering setting, [7, Chap.2],[89] for compression scheme... |

55 | Adaptive sampling methods for scaling up knowledge discovery algorithms
- Domingo, Gavaldà, et al.
(Show Context)
Citation Context ...l simulations show that it significantly outperforms previously known stopping rules, in particular AA [49] and the Nonmonotonic Adaptive Sampling (NAS) algorithm due to Domingo, Gavalda and Watanabe =-=[130, 54]-=-. Figure 4.3 shows the results of running different stopping rules for the distribution ν of the average of 10 uniform random variables on [µ−1/2, µ+1/2] with varying µ and also on Bernoulli distribut... |

55 | PAC-Bayesian Learning of Linear Classifiers
- Lacasse, Laviolette, et al.
- 2009
(Show Context)
Citation Context ...tations using Stirling’s approximation. The same procedure can be used to prove the other PAC-Bayesian bounds of Chapter 2, Section 2.2. A similar way of approaching PAC-Bayesian theorems is given in =-=[60]-=-. 7374Appendix D Proof of the learning rate of the progressive mixture rule Here is the proof in a concise form under the boundedness assumptions of Theorem 6 that the expected excess risk of the pr... |

54 | Minimizing regret with label efficient prediction - Cesa-Bianchi, Lugosi, et al. |

52 | Best Arm Identification in MultiArmed Bandits - Audibert, Bubeck, et al. - 2010 |

52 | Fast learning rates for plug-in classifiers
- Audibert, Tsybakov
(Show Context)
Citation Context ... leads by thresholding to a classification decision rule, and the quality of this plug-in estimator is directly linked to the quality of the least squares regression estimator (see [53, Section 6.2], =-=[16]-=- and specifically the comparison lemmas of its section 5, and also [95, 27, 28] for consistency results in classification using other surrogate loss functions). Boosting type classification methods us... |

52 | AND TSYBAKOV, A.: Learning by mirror averaging
- JUDITSKY, RIGOLLET
- 2005
(Show Context)
Citation Context ...zed variants are really poor algorithms in this task since their expected convergence rate cannot be uniformly faster than √ (log d)/n. The following lower bound comes from [8] (see [92], [39, p.14], =-=[90, 72, 106]-=- for similar results and variants). THEOREM 5 For any training set size n, there exist d prediction functions g1, . . . , gd taking their values in [−1, 1] such that for any learning algorithm ˆg prod... |

48 |
Multi-armed Bandit Allocation Indices. Wiley-Interscience series in systems and optimization
- Gittins
- 1989
(Show Context)
Citation Context ...erstand how the expected pseudo-regret behaves. 4.2.3. INTRODUCTION TO UPPER CONFIDENCE BOUNDS POLICIES. Early papers have studied stochastic bandit problems under Bayesian assumptions (e.g., Gittins =-=[61]-=-). On the contrary, Lai and Robbins [84] have considered a parametric minimax framework. They have introduced an algorithm that follows what is now called the “optimism in the face of uncertainty prin... |

46 | Analysis of two gradient-based algorithms for on-line regression
- Cesa-Bianchi
(Show Context)
Citation Context ...iated gradient algorithm introduced and studied by Kivinen and Warmuth [74] in the context of sequential prediction for the quadratic loss, and then extended to general loss functions by Cesa-Bianchi =-=[43]-=-. Lemma 7 has to be invoked to convert these algorithms and the bounds to our statistical framework. Juditsky, Nazin, Tsybakov and Vayatis [73] has viewed the resulting algorithm as a stochastic versi... |

44 | Online optimization in X-armed bandits
- Bubeck, Munos, et al.
- 2009
(Show Context)
Citation Context |

42 | Mixture approach to universal model selection. Technical report, Ecole Normale Supérieure
- Catoni
- 1997
(Show Context)
Citation Context ...d works as [26, 80, 99, 34] and references within). The idea of the proof, which comes back to Barron [24], is based on a chain rule and appeared to be successful for least squares and entropy losses =-=[38, 39, 25, 131]-=- and for general loss in [72]. Here my first contribution was to take ideas coming from the field of sequential prediction of nonrandom sequences (see e.g. [107, 46] for a general overview and [65, 44... |

39 | Regret bounds and minimax policies under partial monitoring
- Audibert, Bubeck
(Show Context)
Citation Context ...tein’s bound with estimated variances to have better stopping rules (Section 4.2.8), • provide a policy to identify the best arm at the end of the n time steps (Section 4.2.9). Sébastien Bubeck and I =-=[12]-=- contribute to the adversarial setting by designing a new type of weighted average forecaster characterized by an implicit normalization of the weights, and for which a new type of analysis can be dev... |

39 |
Functional aggregation for nonparametric estimation
- Juditsky, Nemirovski
- 2000
(Show Context)
Citation Context ... each model, or on the contrary, be the same but using different values of a tuning parameter. 21of this scheme. The idea of mixing (or combining or aggregating) the estimators originally appears in =-=[110, 71, 132, 133]-=-. We hereafter treat the initial estimators as fixed functions, which means that the results hold conditionally on the data set on which they have been obtained, this data set being independent of the... |

39 | Sparse recovery in large ensembles of kernel machines
- Koltchinskii, Yuan
- 2008
(Show Context)
Citation Context |

38 |
Universal” aggregation rules with exact bias bounds. Preprint n.510, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. Available at http://www.proba.jussieu.fr/mathdoc/preprints
- Catoni
- 1999
(Show Context)
Citation Context ...s) cannot be uniformly smaller than √ log d n log d C . Since the minimax optimal rate is , this shows that these estin mators are inappropriate for the model selection task (Section 3.2.1). • Catoni =-=[39]-=- and Yang [131] have independently shown that the optimal rate log d in the model selection problem is achieved for the progressive mixture n rule. In [9], I provide a variant of this estimator coming... |

38 | A PAC-Bayesian approach to adaptive classification
- Catoni
- 2003
(Show Context)
Citation Context ...nε−1 ). In particular, when the empirical risk of the randomized estimator is zero, this last bound is of 1/n order, while (McA) only gives a 1/ √ n order. Still in the classification setting, Catoni =-=[40]-=- proposed a different bound: for any ε > 0 and λ > 0 with λ n ρ ∈ M, where Eg∼ρR(g) ≤ Ψ( λ n (S’) ) < 1, with probability at least 1 − ε, for any Eg∼ρr(g) 1 − λ λ Ψ( n n ) + K(ρ, π) + log(ε−1 ) λ[1 − ... |

37 | Fast learning rates in statistical inference through aggregation
- Audibert
(Show Context)
Citation Context ...del selection task (Section 3.2.1). • Catoni [39] and Yang [131] have independently shown that the optimal rate log d in the model selection problem is achieved for the progressive mixture n rule. In =-=[9]-=-, I provide a variant of this estimator coming from the field of sequential prediction of nonrandom sequences, and called the progressive indirect mixture rule. It has the benefit of satisfying a tigh... |

36 | Adaboost is consistent
- Bartlett, Traskin
(Show Context)
Citation Context ...ty of this plug-in estimator is directly linked to the quality of the least squares regression estimator (see [53, Section 6.2], [16] and specifically the comparison lemmas of its section 5, and also =-=[95, 27, 28]-=- for consistency results in classification using other surrogate loss functions). Boosting type classification methods usually aggregate simple functions, but the aggregation is also of interest when ... |

35 | On prediction of individual sequences
- Cesa-Bianchi, Lugosi
- 1999
(Show Context)
Citation Context ...5, 131] and for general loss in [72]. Here my first contribution was to take ideas coming from the field of sequential prediction of nonrandom sequences (see e.g. [107, 46] for a general overview and =-=[65, 44, 45, 134]-=- for more specific results with sharp constants) and propose a slight generalization of progressive mixture rules, that I called progressive indirect mixture rules. The progressive indirect mixture ru... |

34 | The price of truthfulness for pay-perclick auctions
- Devanur, Kakade
- 2009
(Show Context)
Citation Context ...ngly most rewarding arms. The multi-armed bandit problem is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement =-=[21, 52]-=-, economics [29, 85], games [59] and optimization [77, 48, 76, 35]. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in par... |

33 | Sparse regression learning by aggregation and langevin monte-carlo
- Dalalyan, Tsybakov
- 2009
(Show Context)
Citation Context ...r margin-based bounds from Gaussian prior distributions, [83] for an Adaboost setting, that is majority vote of weak learners, [118] in a clustering setting, [7, Chap.2],[89] for compression schemes, =-=[50, 51]-=- for PAC bounds with sparsity-inducing prior distributions). My contributions to the PAC-Bayesian approach are the use of relative PACBayesian bounds to design estimators with minimax rates (Section 2... |

28 | Characterizing truthful multiarmed bandit mechanisms: extended abstract
- Babaioff, Sharma, et al.
- 2009
(Show Context)
Citation Context ...ngly most rewarding arms. The multi-armed bandit problem is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement =-=[21, 52]-=-, economics [29, 85], games [59] and optimization [77, 48, 76, 35]. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in par... |

27 |
PAC-Bayesian bounds for randomized empirical risk minimizers
- Alquier
(Show Context)
Citation Context ...first part the prior distribution to be used on the second part of the training set. Catoni [41] uses π −n log[1+(e β/n −1)R] to obtain tighter localized bounds in the classification setting. Alquier =-=[4, 5]-=- uses π−βR for general unbounded losses with application to regression and density estimation. 2.3. COMPARISON OF THE RISK OF TWO RANDOMIZED ESTIMATORS 2.3.1. RELATIVE PAC-BAYESIAN BOUNDS. My PhD (its... |

27 | PAC-Bayesian statistical learning theory - Audibert - 2004 |

26 | Progressive mixture rules are deviation suboptimal
- Audibert
(Show Context)
Citation Context ...icker than exponential tails), and show how the noise influences the minimax optimal convergence rate. I also provide refined lower bounds of Assouad’s type with tight constants (Section 3.2.2). • In =-=[8]-=-, I show a limitation of the algorithms known to satisfy (3.1.1): despite having an expected excess risk of order 1/n (if we drop the dependence in d), the excess risk of the progressive (indirect or ... |

25 | Some label efficient learning results
- Helmbold, Panizza
- 1997
(Show Context)
Citation Context ...ility distribution pt+1 = (p1,t+1, . . . , pK,t+1) where pi,t+1 = ψ( ˜ Gi,t − Ct). Figure 4.8: The proposed policy for the four prediction games. The label efficient game. This game was introduced by =-=[66]-=-: as explained in Figure 4.7, the forecaster observes the reward vector only if he asks for it, and he is not allowed to ask it more than m times for some fixed integer number 1 ≤ m ≤ n. Following the... |

24 | When can the two-armed bandit algorithm be trusted? The Annals of Applied Probability
- Lamberton, Pagès, et al.
(Show Context)
Citation Context ...arms. The multi-armed bandit problem is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [21, 52], economics =-=[29, 85]-=-, games [59] and optimization [77, 48, 76, 35]. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large sta... |

22 | 2005b) Recursive aggregation of estimators by a mirror descent method with averaging
- Juditsky, Nazin, et al.
(Show Context)
Citation Context ... then extended to general loss functions by Cesa-Bianchi [43]. Lemma 7 has to be invoked to convert these algorithms and the bounds to our statistical framework. Juditsky, Nazin, Tsybakov and Vayatis =-=[73]-=- has viewed the resulting algorithm as a stochastic version of the mirror descent algorithm, and proposed a different choice of the temperature parameter, while still reaching the optimal convergence ... |

21 |
Exploration-exploitation trade-off using variance estimates in multi-armed bandits
- Szepesvári
(Show Context)
Citation Context ...it problem was to provide a theoretical justification of these empirical findings, as described in the following section. 4.2.4. UCB POLICY WITH VARIANCE ESTIMATES. Rémi Munos, Csaba Szepesvári and I =-=[15]-=- have proposed the following slight modification of the arm indexes given by (4.2.1): Bi,s,t = Xi,s + √ 2ζVi,s ¯ log t + s 3ζ log t , (4.2.2) s with ζ > 1. The associated policy achieves a logarithmic... |

18 | Suboptimality of penalized empirical risk minimization in classification
- Lecué
- 2007
(Show Context)
Citation Context ...zed variants are really poor algorithms in this task since their expected convergence rate cannot be uniformly faster than √ (log d)/n. The following lower bound comes from [8] (see [92], [39, p.14], =-=[90, 72, 106]-=- for similar results and variants). THEOREM 5 For any training set size n, there exist d prediction functions g1, . . . , gd taking their values in [−1, 1] such that for any learning algorithm ˆg prod... |

17 |
Lower bounds for the empirical minimization algorithm
- Mendelson
(Show Context)
Citation Context ...zed variants are really poor algorithms in this task since their expected convergence rate cannot be uniformly faster than √ (log d)/n. The following lower bound comes from [8] (see [92], [39, p.14], =-=[90, 72, 106]-=- for similar results and variants). THEOREM 5 For any training set size n, there exist d prediction functions g1, . . . , gd taking their values in [−1, 1] such that for any learning algorithm ˆg prod... |

16 | Tighter PACbayes bounds
- Ambroladze, Parrado-Hernández, et al.
- 2006
(Show Context)
Citation Context ... sufficient to prove tight theoretical bounds for this estimator in different contexts: density estimation, classification and least squares regression. Ambroladze, Parrado-Hernández and Shawe-Taylor =-=[6]-=- proposed a different way to reduce the influence of a “flat” prior distribution. Their localization scheme is based on cutting the training set into two parts and learn from the first part the prior ... |

15 |
Theory of classification: some recent advances
- Boucheron, Bousquet, et al.
- 2005
(Show Context)
Citation Context |

14 |
Bandit problems with infinitely many arms
- Berry, Chen, et al.
- 1997
(Show Context)
Citation Context ...er of the mean-reward distribution, the probability that a new arm is δ-optimal is of order δβ for small δ, i.e. P(µk ≥ µ ∗ − δ) = Θ(δβ ) for δ → 04 . In contrast with the previous many-armed bandits =-=[30, 121]-=-, our setting allows general reward distributions for the arms, under a simple assumption on the mean-reward. When there is more arms than the available number of experiments, the exploration takes tw... |

14 | Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740
- Maurer, Pontil
- 2009
(Show Context)
Citation Context ... improvement of Inequality (5.27) of Blanchard [33]. For t = n ≥ 2, i.e. without the stopping time argument due to Freedman [57] allowing to have the inequality uniformly over time, Maurer and Pontil =-=[101]-=- improves on the constants of the above inequality when the empirical variance is close to 0. Considering the unbiased variance estimator ¯ V ′ Ūt) 2 = t t−1 ¯ Vt, they obtain that with probability at... |

13 | Improved rates for the stochastic continuum-armed bandit problem. Learning Theory - Szepesvári - 2007 |

12 |
Transductive and inductive adaptative inference for regression and density estimation
- Alquier
- 2006
(Show Context)
Citation Context ...first part the prior distribution to be used on the second part of the training set. Catoni [41] uses π −n log[1+(e β/n −1)R] to obtain tighter localized bounds in the classification setting. Alquier =-=[4, 5]-=- uses π−βR for general unbounded losses with application to regression and density estimation. 2.3. COMPARISON OF THE RISK OF TWO RANDOMIZED ESTIMATORS 2.3.1. RELATIVE PAC-BAYESIAN BOUNDS. My PhD (its... |

9 |
Risk bounds in linear regression through PAC-bayesian truncation
- Audibert, Catoni
- 2009
(Show Context)
Citation Context ...has the minimax optimal rate of task (C), and is adaptive to the extent that it has also the minimax optimal rate of task (MS) when R(g∗ MS ) = R(g∗ C ) (Section 3.3). • Finally, Olivier Catoni and I =-=[14]-=- provide minimax results for (L), and consequently also for (C) when d ≤ √ n. The strong point of these results is that it does not require the knowledge of the input distribution, nor uniformly bound... |

9 |
A decision–theoretic generalization of on– line learning and an application to boosting
- Freund, Schapire
- 1997
(Show Context)
Citation Context ...ecome popular and has been intensively studied these last two decades partly thanks to the success of boosting algorithms, and principally of the AdaBoost algorithm, introduced by Freund and Schapire =-=[58]-=-. These algorithms use linear combination of a large number of simple functions to provide a classification decision rule. In this chapter, we focus on the least squares setting, in which the outputs ... |

9 | PAC-Bayes risk bounds for stochastic averages and majority votes of sample-compressed classifiers
- Laviolette, Marchand
(Show Context)
Citation Context ...ets (see e.g. [88, 103, 82] for margin-based bounds from Gaussian prior distributions, [83] for an Adaboost setting, that is majority vote of weak learners, [118] in a clustering setting, [7, Chap.2],=-=[89]-=- for compression schemes, [50, 51] for PAC bounds with sparsity-inducing prior distributions). My contributions to the PAC-Bayesian approach are the use of relative PACBayesian bounds to design estima... |

8 | Combining PAC-bayesian and generic chaining bounds
- Audibert, Bousquet
(Show Context)
Citation Context ...GENERIC CHAINING BOUNDS There exist many different risk bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or 17algorithms. In =-=[10]-=-, Olivier Bousquet and I underline the links between these bounds, and combine several different improvements into a single bound. In particular, we combine the PAC-Bayes approach with the optimal uni... |

8 |
Bandit problems. 2008
- Bergemann, Valimaki
(Show Context)
Citation Context ...arms. The multi-armed bandit problem is the simplest setting where one encounters the exploration-exploitation dilemma. It has a wide range of applications including advertizement [21, 52], economics =-=[29, 85]-=-, games [59] and optimization [77, 48, 76, 35]. It can be a central building block of larger systems, like in evolutionary programming [68] and reinforcement learning [119], in particular in large sta... |

7 | Adaptive routing using expert advice
- György, Ottucsák
(Show Context)
Citation Context ...og K factor and that the high probability bound is valid for the same policy at any confidence level. Label efficient and bandit game (LE bandit). In this game first considered by György and Ottucsák =-=[64]-=- and which is a combination of two previously seen games, the forecaster observes the reward of the arm he selected only if he asks for it, and he is not allowed to request it more than m times for so... |

6 |
Lectures on probability theory and statistics. Part II: topics in Non-parametric statistics. Springer-Verlag. Probability summer school, Saint Flour
- Nemirovski
- 1998
(Show Context)
Citation Context ... each model, or on the contrary, be the same but using different values of a tuning parameter. 21of this scheme. The idea of mixing (or combining or aggregating) the estimators originally appears in =-=[110, 71, 132, 133]-=-. We hereafter treat the initial estimators as fixed functions, which means that the results hold conditionally on the data set on which they have been obtained, this data set being independent of the... |

5 |
Aggregation via empirical risk minimization. Probability Theory and Related
- Lecué, Mendelson
- 2009
(Show Context)
Citation Context ...mptions, the rate cannot be better than n −2/3 for an adequate choice of the functions and the distribution (proof omitted by lack of interest in negative results). Interestingly, Lecué and Mendelson =-=[91]-=- proposed a variant of the empirical star algorithm, which also uses the empirical risk minimizer ˆg (erm) to define a set of functions on which the empirical risk is minimized. Precisely, for a confi... |

4 | High confidence estimates of the mean of heavy-tailed real random variables. 2009. Available on Arxiv
- Catoni
(Show Context)
Citation Context ...e under such weak hypotheses, and this shows even in the simplest case of the estimation of the mean of a real-valued random variable by its empirical mean, which is the case when d = 1 and g(X) ≡ 1 =-=[42]-=-. Typically, the proof of Theorem 14 shows that nε is of order 1/ε. To avoid this limitation, we were conducted to consider more involved algorithms as described in the following two sections. 3.4.2. ... |

4 |
PAC-Bayes & margins. Advances in neural information processing systems
- Langford, Shawe-Taylor
- 2003
(Show Context)
Citation Context ...ave shown that the approach is indeed useful, and that PAC-Bayesian bounds lead to tight bounds, which are often representative of the risk behaviour even for relatively small training sets (see e.g. =-=[88, 103, 82]-=- for margin-based bounds from Gaussian prior distributions, [83] for an Adaboost setting, that is majority vote of weak learners, [118] in a clustering setting, [7, Chap.2],[89] for compression scheme... |

3 |
Méthodes de mélange et d’agrégation d’estimateurs en reconnaissance de formes. Application aux arbres de décision
- Blanchard
- 2001
(Show Context)
Citation Context ...ity at least 1 − ε: for any t ∈ {1, . . . , n}, we have V ≤ (√ ¯Vt + n log(3ε−1 ) t2 √ n log(3ε−1 ) + 2t2 (1 − 3 ¯ ) 2 Vt) . This bound can be seen as an improvement of Inequality (5.27) of Blanchard =-=[33]-=-. For t = n ≥ 2, i.e. without the stopping time argument due to Freedman [57] allowing to have the inequality uniformly over time, Maurer and Pontil [101] improves on the constants of the above inequa... |

2 |
The benefit of group sparsity. 2009. Available on Arxiv
- Huang, Zhang
(Show Context)
Citation Context ...combination defining g (group) . This type of results has not been obtained yet for the group Lasso [135] even when assuming low correlation between the variables, except for the fixed design setting =-=[69, 94]-=-. We have presented in this section an example of theoretical results easily obtainable from the estimators solving problems (MS) and (L). The results are expressed in terms of sub-exponential excess ... |

2 |
PACBayes Bounds for the Risk
- Lacasse, Laviolette, et al.
- 2007
(Show Context)
Citation Context ...s lead to tight bounds, which are often representative of the risk behaviour even for relatively small training sets (see e.g. [88, 103, 82] for margin-based bounds from Gaussian prior distributions, =-=[83]-=- for an Adaboost setting, that is majority vote of weak learners, [118] in a clustering setting, [7, Chap.2],[89] for compression schemes, [50, 51] for PAC bounds with sparsity-inducing prior distribu... |