Results 1  10
of
61
The Knowledge Gradient Algorithm for a General Class of Online Learning Problems
, 2012
"... We derive a oneperiod lookahead policy for finite and infinitehorizon online optimal learning problems with Gaussian rewards. Our approach is able to handle the case where our prior beliefs about the rewards are correlated, which is not handled by traditional multiarmed bandit methods. Experimen ..."
Abstract

Cited by 36 (21 self)
 Add to MetaCart
We derive a oneperiod lookahead policy for finite and infinitehorizon online optimal learning problems with Gaussian rewards. Our approach is able to handle the case where our prior beliefs about the rewards are correlated, which is not handled by traditional multiarmed bandit methods. Experiments show that our KG policy performs competitively against the bestknown approximation to the optimal policy in the classic bandit problem, and it outperforms many learning policies in the correlated case.
An asymptotically optimal bandit algorithm for bounded support models
 In Proceedings of the Twentythird Conference on Learning Theory (COLT 2010
, 2010
"... Multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation in reinforcement learning. This problem is expressed as a model of a gambler playing a slot machine with multiple arms. We study stochastic bandit problem where each arm has a reward distribution support ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
(Show Context)
Multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation in reinforcement learning. This problem is expressed as a model of a gambler playing a slot machine with multiple arms. We study stochastic bandit problem where each arm has a reward distribution supported in a known bounded interval, e.g. [0, 1]. In this model, Auer et al. (2002) proposed practical policies called UCB and derived finitetime regret of UCB policies. However, policies achieving the asymptotic bound given by Burnetas and Katehakis (1996) have been unknown for the model. We propose Deterministic Minimum Empirical Divergence (DMED) policy and prove that DMED achieves the asymptotic bound. Furthermore, the index used in DMED for choosing an arm can be computed easily by a convex optimization technique. Although we do not derive a finitetime regret, we confirm by simulations that DMED achieves a regret close to the asymptotic bound in finite time. 1
Efficient crowdsourcing of unknown experts using multiarmed bandits
 In ECAI
, 2012
"... Abstract. We address the expert crowdsourcing problem, in which an employer wishes to assign tasks to a set of available workers with heterogeneous working costs. Critically, as workers produce results of varying quality, the utility of each assigned task is unknown and can vary both between workers ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
Abstract. We address the expert crowdsourcing problem, in which an employer wishes to assign tasks to a set of available workers with heterogeneous working costs. Critically, as workers produce results of varying quality, the utility of each assigned task is unknown and can vary both between workers and individual tasks. Furthermore, in realistic settings, workers are likely to have limits on the number of tasks they can perform and the employer will have a fixed budget to spend on hiring workers. Given these constraints, the objective of the employer is to assign tasks to workers in order to maximise the overall utility achieved. To achieve this, we introduce a novel multi–armed bandit (MAB) model, the bounded MAB, that naturally captures the problem of expert crowdsourcing. We also propose an algorithm to solve it efficiently, called bounded ε–first, which uses the first εB of its total budget B to derive estimates of the workers’ quality characteristics (exploration), while the remaining (1 − ε) B is used to maximise the total utility based on those estimates (exploitation). We show that using this technique allows us to derive an
Crowd Mining
"... Harnessing a crowd of Web users for data collection has recently become a widespread phenomenon. A key challenge is that the human knowledge forms an open world and it is thus difficult to know what kind of information we should be looking for. Classic databases have addressed this problem by data ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
(Show Context)
Harnessing a crowd of Web users for data collection has recently become a widespread phenomenon. A key challenge is that the human knowledge forms an open world and it is thus difficult to know what kind of information we should be looking for. Classic databases have addressed this problem by data mining techniques that identify interesting data patterns. These techniques, however, are not suitable for the crowd. This is mainly due to properties of the human memory, such as the tendency to remember simple trends and summaries rather than exact details. Following these observations, we develop here for the first time the foundations of crowd mining. We first define the formal settings. Based on these, we design a framework of generic components, used for choosing the best questions to ask the crowd and mining significant patterns from the answers. We suggest general implementations for these components, and test the resulting algorithm’s performance on benchmarks that we designed for this purpose. Our algorithm consistently outperforms alternative baseline algorithms.
Strategic Advice Provision in Repeated HumanAgent Interactions
"... This paper addresses the problem of automated advice provision in settings that involve repeated interactions between people and computer agents. This problem arises in many real world applications such as route selection systems and office assistants. To succeed in such settings agents must reason ..."
Abstract

Cited by 15 (13 self)
 Add to MetaCart
(Show Context)
This paper addresses the problem of automated advice provision in settings that involve repeated interactions between people and computer agents. This problem arises in many real world applications such as route selection systems and office assistants. To succeed in such settings agents must reason about how their actions in the present influence people’s future actions. This work models such settings as a family of repeated bilateral games of incomplete information called “choice selection processes”, in which players may share certain goals, but are essentially selfinterested. The paper describes several possible models of human behavior that were inspired by behavioral economic theories of people’s play in repeated interactions. These models were incorporated into several agent designs to repeatedly generate offers to people playing the game. These agents were evaluated in extensive empirical investigations including hundreds of subjects that interacted with computers in different choice selections processes. The results revealed that an agent that combined a hyperbolic discounting model of human behavior with a social utility function was able to outperform alternative agent designs, including an agent that approximated the optimal strategy using continuous MDPs and an agent using epsilongreedy strategies to describe people’s behavior. We show that this approach was able to generalize to new people as well as choice selection processes that were not used for training. Our results demonstrate that combining computational approaches with behavioral economics models of people in repeated interactions facilitates the design of advice provision strategies for a large class of realworld settings.
The Knowledge Gradient Algorithm For Online Subset Selection
"... Abstract — We derive a oneperiod lookahead policy for online subset selection problems, where learning about one subset also gives us information about other subsets. We show that the resulting decision rule is easily computable, and present experimental evidence that the policy is competitive aga ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract — We derive a oneperiod lookahead policy for online subset selection problems, where learning about one subset also gives us information about other subsets. We show that the resulting decision rule is easily computable, and present experimental evidence that the policy is competitive against other online learning policies. I.
S.: MCTS based on simple regret
 In: Proc. Assoc. Adv. Artif. Intell
, 2012
"... UCT, a stateofthe art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multiarmed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the fina ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
UCT, a stateofthe art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multiarmed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final “arm pull ” (the actual move selection) that collects a reward, rather than all “arm pulls”. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multiarmed bandits with lower finitetime and asymptotic simple regret than UCB, using it to develop a twostage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is nontrivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is “aware ” of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.
Click shaping to optimize multiple objectives
 In KDD. 132–140
, 2011
"... Recommending interesting content to engage users is important for web portals (e.g. AOL, MSN, Yahoo!, and many others). Existing approaches typically recommend articles to optimize for a single objective, i.e., number of clicks. However a click is only the starting point of a user’s journey and su ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Recommending interesting content to engage users is important for web portals (e.g. AOL, MSN, Yahoo!, and many others). Existing approaches typically recommend articles to optimize for a single objective, i.e., number of clicks. However a click is only the starting point of a user’s journey and subsequent downstream utilities such as timespent and revenue are important. In this paper, we call the problem of recommending links to jointly optimize for clicks and postclick downstream utilities click shaping. We propose a multiobjective programming approach in which multiple objectives are modeled in a constrained optimization framework. Such a formulation can naturally incorporate various applicationdriven requirements. We study several variants that model different requirements as constraints and discuss some of the subtleties involved. We conduct our experiments on a large dataset from a real system by using a newly proposed unbiased evaluation methodology [17]. Through extensive experiments we quantify the tradeoff between different objectives under various constraints. Our experimental results show interesting characteristics of different formulations and our findings may provide valuable guidance to the design of recommendation engines for web portals.
Robot Task Switching under Diminishing Returns
"... Abstract — We investigate the problem of a robot maximizing its longterm average rate of return on work. We present a means to obtain an estimate of the instantaneous rate of return when work is rewarded in discrete atoms, and a method that uses this to recursively maximize the longterm average re ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract — We investigate the problem of a robot maximizing its longterm average rate of return on work. We present a means to obtain an estimate of the instantaneous rate of return when work is rewarded in discrete atoms, and a method that uses this to recursively maximize the longterm average return when work is available in localized patches, each with locally diminishing returns. We examine a puckforaging scenario, and test our method in simulation under a variety of conditions. However, the analysis and approach applies to the general case. Should I stay or should I go? I.
Optimal CrowdPowered Rating and Filtering Algorithms
"... We focus on crowdpowered filtering, i.e., filtering a large set of items using humans. Filtering is one of the most commonly used building blocks in crowdsourcing applications and systems. While solutions for crowdpowered filtering exist, they make a range of implicit assumptions and restrictions, ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
We focus on crowdpowered filtering, i.e., filtering a large set of items using humans. Filtering is one of the most commonly used building blocks in crowdsourcing applications and systems. While solutions for crowdpowered filtering exist, they make a range of implicit assumptions and restrictions, ultimately rendering them not powerful enough for realworld applications. We describe two approaches to discard these implicit assumptions and restrictions: one, that carefully generalizes prior work, leading to an optimal, but oftentimes intractable solution, and another, that provides a novel way of reasoning about filtering strategies, leading to a sometimes suboptimal, but efficiently computable solution (that is asymptotically close to optimal). We demonstrate that our techniques lead to significant reductions in error of up to 30 % for fixed cost over prior work in a novel crowdsourcing application: peer evaluation in online courses. 1.