Reinforcement Learning

The document discusses Markov Decision Processes (MDPs) and the exploration vs. exploitation dilemma in decision-making, highlighting the importance of balancing information gathering with optimizing immediate rewards. It introduces the Multi-Armed Bandit (MAB) problem and various strategies for maximizing cumulative rewards, including epsilon-greedy, Upper Confidence Bound (UCB), and Thompson Sampling. Additionally, it covers reinforcement learning concepts, including policy-based and value-based methods, and notable breakthroughs in deep reinforcement learning applications.

Uploaded by

mishranavneet1611

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

9 views54 pages

Reinforcement Learning

Uploaded by

mishranavneet1611

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

° e e Markov De Cismemsmsmmsms (IVIDP) An MDP is defined by 5-tuple : (S,A,R, P,y) * Aset of states S + Agent's knowledge ofthe environment * Aset of actions A * Set of transition Probabilities P Py(s,8") = Pri Siva = 5°15, = 6 00 = a) probability thot bind from Sis at tne wil ede dts tine * Set of rewards R + R(@,s): reward received from state softer taking action a Select action to maximize E(Ro + y-Ry+y?-Rp+-) * y € (0,1] is the discount factor. + _Represents the importance of future rewards in comparison to mearate rewardDiscount Factor * Weare uncertain about future (e.g. predicting stock price after 2 years) * Prefer to collect more rewards sooner (Be happy now than 5 years later) + Mathematical convenienceExploration vs Exploitation Dilemma * Online decision-making involves a fundamental choice + Exploration: Gather more information + Procure a new item + Exploitation: Make the best decision given current information Procure the most profitable item * The best long-term strategy might require short-term sacrifices * Gather enough information to make the best overall decisions°e e be Exploration vs Exploitation Dilemma + Restaurant Selection + Exploitation: Go to your favorite restaurant, + Exploration: Try a new restaurant * Music * Exploitation: Listen to your favorite song * Exploration: Try a different song * Display advertisement * Exploitation: Show the most successful ad + Exploration: Show a new adMulti-armed bandits (MAB) eh wp “Fi #° e be The Multi-Arni@a Bandit (MAB) Problem TS “ne ‘A A multi-armed bandit is a tuple (A, R) * Ais a known and finite set of actions (or ‘arms') 4,2," dy} * Ra(r) = p(r | a) isan unknown probability distribution over rewards At each step t = 1, 2, Agent selects action A; € A Environment selects reward R; ~ Ra, Goal: Maximize expected cumulative reward E lo 100=k Multi-armed Bandits and RL°e e cs Multi-Armed Bancre=Reereeas * Goal is to collect as much reward as possible. + Sum of rewards is not very informative - doesn’t tell us how far we are from the optimum reward. * Mean reward for action A, Va = ie 1A] Total regret T*Vy — DV * Optimal arm [ ot. At = argmax V4 "A ‘Maximum achievable Reward collected by reword a bandit olgorithm Minimize total regret = Maximize cumulative reward° e ts Application : Advts Coe maces en on the Beach - Lowest price quaranee - booking com ¥ Of Hotel Bookings - [Link] ~[Link] Beach Palace Resert Al Inclusive, Cancun - Hotel Bockna°e e e Multi-armed banessmamzsmommGreedy - Random arm with probability ¢ 7 action = A 1 © Best arm with probability 1~ ¢ wf iN 66-8 candéate for exploration Production friendly: instantly convert a production algorithm to bandit! How to set the value of ¢? Standard Practice ind reduce its value over timeUpper Confidence Bound (UCB) Initialization + Pull each arm at least once and observe the reward Fou iteration t: Let Na(t) > 1 be the number of times arm a has been pulled Let Vq,z be the mean rewards collected from arm a Pullthe arm a* = argmax (Vax + FEES) and collect the reward Update Na-(t) and VaseUpper Con fitemmssamsxcreel (UCB) Three arms: ay, 43,43 ‘Arm a, (blue arm) has the . highest UCB Pe) After pulling a; (blue arm) + Weare more certain about its value° e - Th0 Nemec | 8 Bayesian approach to Multi-armed bandits * Model for predicting rewards: p( Ra | 9q),Qq is unknown * Posits prior distribution on parameters : p(6,) * Compute posterior distribution of rewards p(8q | De) * Dy = 4,1, 42,72°** Ap, % is the sequence of arms and rewards collected in the first ¢ trials * Use the posterior distribution to guide exploration * Improved performance if distributional assumptions are correct lo soo° e oe Thompson Samp reg * Apply Bayes’ theorem to compute posterior distribution p(@q | Dt) + Sample a reward distribution 8, from posterior + Compute value of arm a, Vz = E[Ra] * Select action to maximize value on sample, a; = argmax Va a° e x Multi-Armed Baneres==rerere” Recommendations for You,e e Multi-Armed Barl@te"Exariple Modeling product recommendation as MAB * Each product is an arm + Rewards are binary * Purchase (labeled as 1) / no-purchase (labeled as 0)°e e % Multi-Armed Bandits: Example Use Beta distribution as prior°e e Multi-Armed Bandits: Example * Posterior distribution of rewards P(A | De) © p(De | )=P(Ox) Bors’ theorem likelihood prior [Ne : Number of impressions of product x in the first ¢ time steps Pz :Number of times x has been purchased in the first ¢ time steps PC De | Ox) & 88 CL — 0) Nae PotMulti-Armed Bandit: Example Initialization + Initialize counters Nz = Pr = 0, product x For iteration t > 1: + Forall x, sample 0,,¢ ~ Beta(P, +a, Nx — Py +B) + Recommend the product x* = argmax (8,,.) and collect the reward + Update Ny and PyePersonalized Recommendation + One bancit problem per user! + Hard to train - need several rounds of experience with same user * Make use of context information ‘+ User demographic + Purchase and browsing history * Context: day, time, location ete. * Products share some features too! + Price, discount X= (Xu%Xa) + Genre Bau eos fetes fetes°e e is Contextual Bandit === A contextual bandit is a tuple (S,A, R) P[s] is an unknown distribution over states (or “‘contexts’) + Aisa known set of actions (or ‘arms’) * Rea(r) = P[r | s,a] isan unknown probability distribution over rewards At each step t + Environment generates context/state s- ~ S + Agent selects action a € A + Environment offers reward 7 ~ Repay Goal: Maximize cumulative reward Y7_Epsilon-Greedy * Good news: e-Greedy works without any change! Select arma that maximizes FSB) FO: your favorite reese)Upper Confidenc@' sear" * Learn @ through Ordinary Least Squares (OLS). + Training data= (Se, 42),Teeenzr Mean reward from arma, G(s,a) = $(s,a)’ 6° O° isthe OLS estimate How do we transform it into a bandit algorithm? * Which arm to explore? Explore arms that we are least confident about. * Capture confidence through variance of G(s,a) Pull the arm a” = argmax (a0 t+c* |Var(qs, ®))Thompson Sampling? Receive context x, Draw 6¢ according to P(@|D) Select a; = argmax, E,.(r|x1, a, 6°) Observe reward r; D=DU (21, a1,7%) end for°e Thompson Sampling? D=0 Posterior inference can be Receive context x; Draw 0 according tXE(@]D) > intractable Select a; = arg max, purchase/no purchase + P(r |5,,8 )isoften modeled as logistic Observe reward 7 mers D=DU (2, a1,%) + pOO1D,) doesthave any clsedform end for Use Gibbs sampling°e e os Contextual Bandigem=aummerey Epsilon-greedy + Easy to implement + Gives explicit control on what to explore + Tuning ¢ is more of an art UCB + Computationally expensive : 0(k®) computation, k: feature dimension + May suffer from delayed feedback Thompson Sampling Drawing samples is computationally expensive but using them in decision making is less expensive than UCB : O(c) computation More resilient to delayed feedback No need to tune exploration budget unlike epsilon-greedyReinforcement LearningReinforcement Learning 1. RL Concepts 2. Approaches to RL 3. Deep RLReinforcement Learning State, Reward, Action, Stimulus, Gain, Payoff, Response, Situation Cost Control (world)Policy * Deterministic Policy * Mapping of states to actions: a = (s) * Agent takes an action: A, = m(S;) * Stochastic Policy + w(a| s) isa probability distribution + Ar ~m(a|S,=s) * Agent’s goal is to choose each action so as to maximize the discounted sum of future rewards * Resa + Resa + ¥7Re43t --- (y: discount factor) + Biko Revise »Action-Value Function Expected long term reward: = * From state s * With action a * Under policy 1 Revs Rees * With discount factor y an(S,4) = ElResa + YReaz + Y?Resat ---|S¢ = 5,Ae = @Arsrco~ M1] rAction-Value Function PSIs.a) —x(a'ts") ‘ Expected long term reward: At * From state s ® * With action a . wN * Under policy Rea Reva * With discount factor Qu (S,a) = E[Revi + VRev2 +? Reagt «++ | Se = 8, Ap = @ Actr:co~ T] =2 p(s'ls.a) fr + y Dala'ts'Yae(s'.a')]Optimal Value Functions * An optimal value function is the maximum achievable value q.(s,a) = maxq,(s,a) for alls €S,a€ A(s) © * Once we have q. we can act optimally to derive optimal policy m*(s) = argmaxq. (s,a) aBellman Optimality Equation Bellman Expectation Equation mets.) an(s,a) = p(s'|s,a) [r +7 Laa'ls') qn(s',a’)] 6..o ~ (a's!) qu(s,a) =% pG'ls,a) fr +y mbox q.(s',a'))Value-based RL * Tabular Methods: In tasks with small, finite state sets value functions can be approximated using tables (entry for each state/state-action pair). * Function Approximation: Take examples from a desired function (e.g., value function) and generalize frommthem to construct approximation of the entire function.Q-Learning (Watkins & Dayan, 92] + A form of model-free method * Agent doesn’t need to have model of the environment * Iteratively approximates the optimal action-value function, q. * Estimates are updated as agent gathers more experience * Off-policy learning method * Learns the value of its determin igedy (target) policy from experience collected using exploration (behavior) policy * Target policy is the policy being learned about * Behavior policy is the policy generating the trajectory data * Convergence: If all pairs (s, a) are continuously updated, q converges to qG Learning Experience: Sample sequences of states, actions and rewards from on-line or simulated interaction with an environment! Learned action-value function Q directly approximates the optimal action-value function q. Actual Gy bes Pe nG) amiuany toGembar Oo » + Repeat fogeach episode: + Initialize $ + Repeat (for each step of the episode) + Choose A from S using policy derived from Q (eg., e- greedy) + Take action A, observe R, S’ + Q(S,A) = a Q(S,A) + (1 a) (R + ymax Q(s',a)) easy * Until S is terminalQ-Learning Update + What we want to compute (Bellman Optimality Eq.): Q(S, A) = Ey [r + y max Q(s’,a’)] 2 * Using the sampling approach, we take a sample of the quantity under expectation r+ymax Q(s',a’) " * Consider a decayed approach where we combine old value with target Q(S,A):= @ Q(S, A) + (1 — a) (7 + y max Q(S',a)) Q(S,A):= Q(S,A) + Aa) [r+ ymax Q(S',a) - QS,A)]Policy-based RL Episode: t, ay mu Policy: we © To © n What is the expected return for an episode? J(g) = E, (R()) P(E) = (50) - odo | $0) PCS4 | da) Moy I Si) PCSr | Spar r-1) RO =n tht = [cri 0)- RG) ae Learn policy m* that maximizes the expected return n* = argmax (m9) toPolicy-based RL Policy update equation momat+aA-Vo (me) How to calculate the policy gradient Vo J(79)? Policy gradient theorem T Vo J(m9) = Ex > Va loge (at Ist) - R(t) =Policy based RL : REINFORCE Monte Carlo approximation Vo Jae) = Ex y Vo logs (ae Is) - weo| ~ D1 Vo logmee(ar Ise) - RC) REINFORCE algorithm: » 1. sample {7‘} from 79(a;|s;) (run the policy) 2. Vod (8) © YY; (Oy Vo log mo(aj|s})) (2, (st, a4) 3.0 04+ aVoJ(0)Q-learning vs. REINFORCE Uses Bellman optimality condition to learn optimalQ Uses Monte Carlo approximation to learn the policy function. Policy is derived implicitly from the Q function. directly. Computational inefficient for large action spaces and _Difficult to converge due to slow rate of learning. continuous actions. Off-policy learning, On-policy learning High bias due to the TD approximation step. High variance due to MC samplingBreakthroughs using Deep RL (DeepMind) Playing Atari with Deep Reinforcement Learning (2013) * Play Atari 2600 video games by just observing screen pixel + Receiving reward when games score increased! * Same model architecture, 7 different games Human-level control through Deep Reinforcement Learning (2015) + Featured on cover of Nature! * Same model architecture, 49 different games, expert human-level performanceBreakthroughs using Deep RL (DeepMind) chess Shos! AlphaZero (2017) * Single system that taught itself from scratch to master Chess, shogi (Japanese chess) and Go. AlphaStar (2019) + First Al to reach top league of StarCraft Il, a real- time strategy video game. + Achieved a Grandmaster level for all three StarCraft II races: Protoss, Terran, and Zerg. * Neural networks, self-play via RL, multi-agent learning, imitation learning* reward | 1,4Why Deep Learning? * Take last 4 screen images, resize to 84 X 84 pixels * Convert to grayscale with 256 gray levels * Possible game states? 25684x84x4 1067970 1 * * Represent the value function by a parameterized function approximator (Deep Neural Network), with parameter w € R”™ * G(s,a; w) Value Function ApproximationDL in a nutshell! * A general purpose framework for representation learning: * Given an objective + Learn representation required to achieve objective * Directly from raw inputs * Using minimal domain knowledgeQ-Learning Revisited * Q-Learning update Q(S, A) = E, Ir +y max Q(s', a a * Treat right-hand side r + y max Q(s',a';w) asa tatget * Minimize MSE loss by stochastic gradient descent l= (" +ymax Q(s',a'3w) — Q(s,a; w) * Converges to q,.using table look-up representationDeep Q-Learning: Q-network Represent value function by Q-network with weights w Q(s,a;w) = Q*(s,a) [Link]) —Q{s,a,,w)Deep Q-Learning: Experience Replay * To remove correlations, build data-set from agent’s own experience 51, a1, 12,52 92, a2, 13,53 53,83, 14, 54 Se, de, Fess Sea1 > [Se Aes Petty Sepa * Sample experiences from data-set and apply update Ls (r+ ymgxQ(s',asw-) — Qs,a; wy + For faster convergence, target parameters v~ held fixed.Deep Q-Learning: Algorithm Initialize replay memory D Initialize action-value function Q with random weight Initialize target action-value function Q with weights @ for episode = I to M do Initialize sequence s; = {-r1} and preprocessed sequence dy = 6(s1) for t= 110 T do ‘random action with probability € argmax, Q(¢(s;),a;0) otherwise Execute action a; in emulator and observe reward ry and image 2141 Set 8141 = 845 4,211 and preprocess Ge-1 = O(8¢41) Store transition (1, 2,7, 011) in D // experience replay Sample random minibatch of transitions (¢;,3,73,@)41) from D sety, {7 if episode terminates at step j +1 Ye Vrs + -ymaxe O(dj41,0%507) otherwise Perform a gradient descent step on (y; — Q(¢j, aj; ))? [Link]. the network parameter @ 7 per update of target network Every C steps reset Q = Q, ice., set 0~ = 0 Following greed pty, select ox ~ {DQN in AtariConclusion * Multi-armed bandits (and contextual bandits) * Epsilon-greedy + UCB + Thompson sampling * Reinforcement learning + Qlearning + Deep QL

Multi-Armed Bandits in Reinforcement Learning
No ratings yet
Multi-Armed Bandits in Reinforcement Learning
8 pages
N-Armed Bandit Problem in RL
No ratings yet
N-Armed Bandit Problem in RL
18 pages
Reinforcement Learning Overview and Techniques
No ratings yet
Reinforcement Learning Overview and Techniques
101 pages
Understanding Multi-Armed Bandits in RL
No ratings yet
Understanding Multi-Armed Bandits in RL
68 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Stochastic Bandit Models in Online Learning
No ratings yet
Stochastic Bandit Models in Online Learning
61 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
111 pages
Understanding Multi-Armed Bandits
No ratings yet
Understanding Multi-Armed Bandits
16 pages
Reinforcement Learning: Epsilon Greedy Methods
No ratings yet
Reinforcement Learning: Epsilon Greedy Methods
17 pages
K-Armed Bandit Problem Analysis
No ratings yet
K-Armed Bandit Problem Analysis
11 pages
Module 2 RFL
No ratings yet
Module 2 RFL
34 pages
Multi-Armed Bandit Strategies Explained
No ratings yet
Multi-Armed Bandit Strategies Explained
25 pages
Reinforcement Learning in Networking
No ratings yet
Reinforcement Learning in Networking
64 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
59 pages
Multi-Armed Bandit Strategies Explained
No ratings yet
Multi-Armed Bandit Strategies Explained
34 pages
Material 3 (Theory Based On Module 1)
No ratings yet
Material 3 (Theory Based On Module 1)
13 pages
Multi-Armed Bandits in Reinforcement Learning
No ratings yet
Multi-Armed Bandits in Reinforcement Learning
7 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
167 pages
Module1 B
No ratings yet
Module1 B
25 pages
Reinforcement Learning in Control Systems
No ratings yet
Reinforcement Learning in Control Systems
97 pages
Reinforcement Learning: Bandits Overview
No ratings yet
Reinforcement Learning: Bandits Overview
41 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
42 pages
Exploration vs. Exploitation in RL
No ratings yet
Exploration vs. Exploitation in RL
56 pages
n-Armed Bandits in Reinforcement Learning
No ratings yet
n-Armed Bandits in Reinforcement Learning
4 pages
Multi-Armed Bandit Problem Explained
No ratings yet
Multi-Armed Bandit Problem Explained
9 pages
RL Unit2
No ratings yet
RL Unit2
51 pages
Understanding Reinforcement Learning Techniques
No ratings yet
Understanding Reinforcement Learning Techniques
9 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
32 pages
Reinforcement Learning Detailed Solutions
No ratings yet
Reinforcement Learning Detailed Solutions
4 pages
Contextual Bandits in Personalization
No ratings yet
Contextual Bandits in Personalization
34 pages
Thompson Sampling and MDP Explained
No ratings yet
Thompson Sampling and MDP Explained
8 pages
Understanding Bandit Problems in RL
No ratings yet
Understanding Bandit Problems in RL
47 pages
Exploration vs Exploitation in RL
No ratings yet
Exploration vs Exploitation in RL
40 pages
Optimizing Exploration in Bandits Algorithm
No ratings yet
Optimizing Exploration in Bandits Algorithm
19 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
81 pages
Statistical Inference in Contextual Bandits
No ratings yet
Statistical Inference in Contextual Bandits
44 pages
Applications of Multi-Armed Bandits
No ratings yet
Applications of Multi-Armed Bandits
8 pages
Online Learning Algorithms Overview
No ratings yet
Online Learning Algorithms Overview
68 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
8 pages
Advanced Reinforcement Learning Course
No ratings yet
Advanced Reinforcement Learning Course
33 pages
Sub-linear Regret in Multi-armed Bandits
No ratings yet
Sub-linear Regret in Multi-armed Bandits
9 pages
Reinforcement Learning Concepts Overview
No ratings yet
Reinforcement Learning Concepts Overview
69 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
62 pages
K-Armed Bandit Problem with Epsilon-Greedy
100% (1)
K-Armed Bandit Problem with Epsilon-Greedy
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
142 pages
Reinforcement Learning Blueprints
No ratings yet
Reinforcement Learning Blueprints
15 pages
Linear and Contextual Bandits Overview
No ratings yet
Linear and Contextual Bandits Overview
21 pages
AI and Machine Learning Overview
No ratings yet
AI and Machine Learning Overview
4 pages
Optimistic Values in 10-Armed Bandits
No ratings yet
Optimistic Values in 10-Armed Bandits
44 pages
Multi-Armed Bandit Problem Overview
No ratings yet
Multi-Armed Bandit Problem Overview
57 pages
Understanding Multi-Armed Bandits
No ratings yet
Understanding Multi-Armed Bandits
32 pages
Multi-Arm Bandits
No ratings yet
Multi-Arm Bandits
30 pages
Understanding the K-Bandit Problem
No ratings yet
Understanding the K-Bandit Problem
67 pages
Multi-Armed Bandit Problem Explained
No ratings yet
Multi-Armed Bandit Problem Explained
14 pages
Reinforcement Learning Overview and Concepts
No ratings yet
Reinforcement Learning Overview and Concepts
4 pages
Multi-Armed Bandit Algorithms Overview
No ratings yet
Multi-Armed Bandit Algorithms Overview
12 pages
Thompson Sampling for MNL-Bandit
No ratings yet
Thompson Sampling for MNL-Bandit
3 pages
Scientific Writing Assignemnts
No ratings yet
Scientific Writing Assignemnts
4 pages

Reinforcement Learning

Uploaded by

Reinforcement Learning

Uploaded by

You might also like