Results 1 -
2 of
2
Spectral bandits for smooth graph functions
- in Proc. Intern. Conf. Mach. Learning (ICML
, 2014
"... Abstract Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased re ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Abstract Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations.
Learning to Optimize Via Information-Directed Sampling
, 2014
"... We propose information-directed sampling – a new algorithm for online optimization prob-lems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-p ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We propose information-directed sampling – a new algorithm for online optimization prob-lems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate simula-tion performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and the knowledge gradient algorithm. Further, we present simple analytic examples illustrating that, due to the way it measures information gain, information-directed sampling can dramatically outperform upper confidence bound algorithms and Thompson sam-pling. 1