Results 1  10
of
68
A ContextualBandit Approach to Personalized News Article Recommendation
"... Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamic ..."
Abstract

Cited by 170 (16 self)
 Add to MetaCart
(Show Context)
Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its articleselection strategy based on userclick feedback to maximize total user clicks. The contributions of this work are threefold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5 % click lift compared to a standard contextfree bandit algorithm, and the advantage becomes even greater when data gets more scarce.
Offpolicy temporaldifference learning with function approximation
 Proceedings of the 18th International Conference on Machine Learning
, 2001
"... We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear fun ..."
Abstract

Cited by 59 (12 self)
 Add to MetaCart
We introduce the first algorithm for offpolicy temporaldifference learning that is stable with linear function approximation. Offpolicy learning is of interest because it forms the basis for popular reinforcement learning methods such as Qlearning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multiscale, multigoal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD(λ) over state–action pairs with importance sampling ideas from our previous work. We prove that, given training under any ɛsoft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the actionvalue function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policy evaluation problem. Our current results are limited to episodic tasks with episodes of bounded length. 1 Although Qlearning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995) is a sevenstate Markov decision process with linearly independent feature vectors, for which an exact solution exists, yet 1 This is a retypeset version of an article published in the Proceedings
Building portable options: Skill transfer in reinforcement learning
 Proceedings of the 20th International Joint Conference on Artificial Intelligence
, 2007
"... The options framework provides methods for reinforcement learning agents to build new highlevel skills. However, since options are usually learned in the same state space as the problem the agent is solving, they cannot be used in other tasks that are similar but have different state spaces. We int ..."
Abstract

Cited by 55 (12 self)
 Add to MetaCart
(Show Context)
The options framework provides methods for reinforcement learning agents to build new highlevel skills. However, since options are usually learned in the same state space as the problem the agent is solving, they cannot be used in other tasks that are similar but have different state spaces. We introduce the notion of learning options in agentspace, the space generated by a feature set that is present and retains the same semantics across successive problem instances, rather than in problemspace. Agentspace options can be reused in later tasks that share the same agentspace but have different problemspaces. We present experimental results demonstrating the use of agentspace options in building transferrable skills, and show that they perform best when used in conjunction with problemspace options. 1
A convergent O(n) algorithm for offpolicy temporaldifference learning with linear function approximation
 Advances in Neural Information Processing Systems 21 (to appear
"... We introduce the first temporaldifference learning algorithm that is stable with linear function approximation and offpolicy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. ..."
Abstract

Cited by 41 (9 self)
 Add to MetaCart
We introduce the first temporaldifference learning algorithm that is stable with linear function approximation and offpolicy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policyevaluation setting in which the data need not come from onpolicy experience. The gradient temporaldifference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same leastsquares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importancesampling methods. 1 Offpolicy learning methods Offpolicy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “offpolicy ” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Qlearning algorithm (Watkins 1989) is an offpolicy temporaldifference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding ɛgreedy policy. Offpolicy methods are also critical to reinforcementlearningbased efforts to model humanlevel world knowledge and state representations as predictions of option outcomes (e.g.,
Importance sampling for reinforcement learning with multiple objectives
, 2001
"... OFTECHNOLOGY hairman, ..."
Learning from scarce experience
 Proceedings of the Nineteenth International Conference on Machine Learning
, 2002
"... Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems. Typically, with each change of the target policy, its value is estimated from the results of following that very policy. This requires a large ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
(Show Context)
Searching the space of policies directly for the optimal policy has been one popular method for solving partially observable reinforcement learning problems. Typically, with each change of the target policy, its value is estimated from the results of following that very policy. This requires a large number of interactions with the environment as different polices are considered. We present a family of algorithms based on likelihood ratio estimation that use data gathered when executing one policy (or collection of policies) to estimate the value of a different policy. The algorithms combine estimation and optimization stages. The former utilizes experience to build a nonparametric representation of an optimized function. The latter performs optimization on this estimate. We show positive empirical results and provide the sample complexity bound. 1.
Learning from logged implicit exploration data
 In Proceedings of the 24th Annual Conference on Neural Information Processing Systems
, 2010
"... We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit ” or “partially labeled ” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline ” data ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
(Show Context)
We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit ” or “partially labeled ” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline ” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of realworld data obtained from Yahoo!. 1
CompetitiveCooperativeConcurrent Reinforcement Learning with Importance Sampling
 In Proc. of International Conference on Simulation of Adaptive Behavior: From Animals and Animats
, 2004
"... The speed and performance of learning depend on the complexity of the learner. A simple learner with few parameters and no internal states can quickly obtain a reactive policy, but its performance is limited. A learner with many parameters and internal states may finally achieve high performance, bu ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
The speed and performance of learning depend on the complexity of the learner. A simple learner with few parameters and no internal states can quickly obtain a reactive policy, but its performance is limited. A learner with many parameters and internal states may finally achieve high performance, but it may take enormous time for learning. Therefore, it is difficult to decide in advance which architecture and algorithm should be used for a new task. In this paper, we propose a new framework for selecting an appropriate policy out of a set of heterogeneous reinforcement learning modules and for correctly improving the policies of all learning modules including those not selected, using the method of importance sampling. In this framework, multiple heterogeneous learning modules sharing the same sensorymotor system can compete to act and cooperate to learn, allowing the overall learning system to obtain a good performance faster. We show in a simulation of partiallyobservable pole balancing task and robotic experiments of batterypack foraging and partially observable Tmaze tasks that a complex learning module trained with the proposed method can actually learn faster than when it is trained alone, by exploiting taskrelevant episodes generated by suboptimal, but fastlearning modules.
Adaptive importance sampling technique for markov chains using stochastic approximation
 Operations Research
, 2004
"... Abstract For a discretetime finitestate Markov chain, we develop an adaptive importance sampling scheme to estimate the expected total cost before hitting a set of terminal states. This scheme updates the change of measure at every transition using constant or decreasing stepsize stochastic appro ..."
Abstract

Cited by 18 (7 self)
 Add to MetaCart
Abstract For a discretetime finitestate Markov chain, we develop an adaptive importance sampling scheme to estimate the expected total cost before hitting a set of terminal states. This scheme updates the change of measure at every transition using constant or decreasing stepsize stochastic approximation. The updates are shown to concentrate asymptotically in a neighborhood of the desired zero variance estimator. Through simulation experiments on simple Markovian queues, we observe that the proposed technique performs very well in estimating performance measures related to rare events associated with queue lengths exceeding prescribed thresholds. We include performance comparisons of the proposed algorithm with existing adaptive importance sampling algorithms on a small example. We also discuss the extension of the technique to estimate the infinite horizon expected discounted cost and the expected average cost.