Results 1 
2 of
2
Spectral bandits for smooth graph functions
 in Proc. Intern. Conf. Mach. Learning (ICML
, 2014
"... Abstract Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased re ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract Smooth functions on graphs have wide applications in manifold and semisupervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in realworld graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on realworld content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
Learning to Optimize Via InformationDirected Sampling
, 2014
"... We propose informationdirected sampling – a new algorithm for online optimization problems in which a decisionmaker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected singlep ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We propose informationdirected sampling – a new algorithm for online optimization problems in which a decisionmaker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected singleperiod regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for informationdirected sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate simulation performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and the knowledge gradient algorithm. Further, we present simple analytic examples illustrating that, due to the way it measures information gain, informationdirected sampling can dramatically outperform upper confidence bound algorithms and Thompson sampling. 1