0% found this document useful (0 votes)
9 views54 pages

Reinforcement Learning

The document discusses Markov Decision Processes (MDPs) and the exploration vs. exploitation dilemma in decision-making, highlighting the importance of balancing information gathering with optimizing immediate rewards. It introduces the Multi-Armed Bandit (MAB) problem and various strategies for maximizing cumulative rewards, including epsilon-greedy, Upper Confidence Bound (UCB), and Thompson Sampling. Additionally, it covers reinforcement learning concepts, including policy-based and value-based methods, and notable breakthroughs in deep reinforcement learning applications.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Reinforcement Learning

The document discusses Markov Decision Processes (MDPs) and the exploration vs. exploitation dilemma in decision-making, highlighting the importance of balancing information gathering with optimizing immediate rewards. It introduces the Multi-Armed Bandit (MAB) problem and various strategies for maximizing cumulative rewards, including epsilon-greedy, Upper Confidence Bound (UCB), and Thompson Sampling. Additionally, it covers reinforcement learning concepts, including policy-based and value-based methods, and notable breakthroughs in deep reinforcement learning applications.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
° e e Markov De Cismemsmsmmsms (IVIDP) An MDP is defined by 5-tuple : (S,A,R, P,y) * Aset of states S + Agent's knowledge ofthe environment * Aset of actions A * Set of transition Probabilities P Py(s,8") = Pri Siva = 5°15, = 6 00 = a) probability thot bind from Sis at tne wil ede dts tine * Set of rewards R + R(@,s): reward received from state softer taking action a Select action to maximize E(Ro + y-Ry+y?-Rp+-) * y € (0,1] is the discount factor. + _Represents the importance of future rewards in comparison to mearate reward Discount Factor * Weare uncertain about future (e.g. predicting stock price after 2 years) * Prefer to collect more rewards sooner (Be happy now than 5 years later) + Mathematical convenience Exploration vs Exploitation Dilemma * Online decision-making involves a fundamental choice + Exploration: Gather more information + Procure a new item + Exploitation: Make the best decision given current information Procure the most profitable item * The best long-term strategy might require short-term sacrifices * Gather enough information to make the best overall decisions °e e be Exploration vs Exploitation Dilemma + Restaurant Selection + Exploitation: Go to your favorite restaurant, + Exploration: Try a new restaurant * Music * Exploitation: Listen to your favorite song * Exploration: Try a different song * Display advertisement * Exploitation: Show the most successful ad + Exploration: Show a new ad Multi-armed bandits (MAB) eh wp “Fi # ° e be The Multi-Arni@a Bandit (MAB) Problem TS “ne ‘A A multi-armed bandit is a tuple (A, R) * Ais a known and finite set of actions (or ‘arms') 4,2," dy} * Ra(r) = p(r | a) isan unknown probability distribution over rewards At each step t = 1, 2, Agent selects action A; € A Environment selects reward R; ~ Ra, Goal: Maximize expected cumulative reward E lo 100 =k Multi-armed Bandits and RL °e e cs Multi-Armed Bancre=Reereeas * Goal is to collect as much reward as possible. + Sum of rewards is not very informative - doesn’t tell us how far we are from the optimum reward. * Mean reward for action A, Va = ie 1A] Total regret T*Vy — DV * Optimal arm [ ot. At = argmax V4 "A ‘Maximum achievable Reward collected by reword a bandit olgorithm Minimize total regret = Maximize cumulative reward ° e ts Application : Advts Coe maces en on the Beach - Lowest price quaranee - booking com ¥ Of Hotel Bookings - [Link] ~[Link] Beach Palace Resert Al Inclusive, Cancun - Hotel Bockna °e e e Multi-armed banessmamzsmommGreedy - Random arm with probability ¢ 7 action = A 1 © Best arm with probability 1~ ¢ wf iN 66-8 candéate for exploration Production friendly: instantly convert a production algorithm to bandit! How to set the value of ¢? Standard Practice ind reduce its value over time Upper Confidence Bound (UCB) Initialization + Pull each arm at least once and observe the reward Fou iteration t: Let Na(t) > 1 be the number of times arm a has been pulled Let Vq,z be the mean rewards collected from arm a Pullthe arm a* = argmax (Vax + FEES) and collect the reward Update Na-(t) and Vase Upper Con fitemmssamsxcreel (UCB) Three arms: ay, 43,43 ‘Arm a, (blue arm) has the . highest UCB Pe) After pulling a; (blue arm) + Weare more certain about its value ° e - Th0 Nemec | 8 Bayesian approach to Multi-armed bandits * Model for predicting rewards: p( Ra | 9q),Qq is unknown * Posits prior distribution on parameters : p(6,) * Compute posterior distribution of rewards p(8q | De) * Dy = 4,1, 42,72°** Ap, % is the sequence of arms and rewards collected in the first ¢ trials * Use the posterior distribution to guide exploration * Improved performance if distributional assumptions are correct lo soo ° e oe Thompson Samp reg * Apply Bayes’ theorem to compute posterior distribution p(@q | Dt) + Sample a reward distribution 8, from posterior + Compute value of arm a, Vz = E[Ra] * Select action to maximize value on sample, a; = argmax Va a ° e x Multi-Armed Baneres==rerere” Recommendations for You, e e Multi-Armed Barl@te"Exariple Modeling product recommendation as MAB * Each product is an arm + Rewards are binary * Purchase (labeled as 1) / no-purchase (labeled as 0) °e e % Multi-Armed Bandits: Example Use Beta distribution as prior °e e Multi-Armed Bandits: Example * Posterior distribution of rewards P(A | De) © p(De | )=P(Ox) Bors’ theorem likelihood prior [Ne : Number of impressions of product x in the first ¢ time steps Pz :Number of times x has been purchased in the first ¢ time steps PC De | Ox) & 88 CL — 0) Nae Pot Multi-Armed Bandit: Example Initialization + Initialize counters Nz = Pr = 0, product x For iteration t > 1: + Forall x, sample 0,,¢ ~ Beta(P, +a, Nx — Py +B) + Recommend the product x* = argmax (8,,.) and collect the reward + Update Ny and Pye Personalized Recommendation + One bancit problem per user! + Hard to train - need several rounds of experience with same user * Make use of context information ‘+ User demographic + Purchase and browsing history * Context: day, time, location ete. * Products share some features too! + Price, discount X= (Xu%Xa) + Genre Bau eos fetes fetes °e e is Contextual Bandit === A contextual bandit is a tuple (S,A, R) P[s] is an unknown distribution over states (or “‘contexts’) + Aisa known set of actions (or ‘arms’) * Rea(r) = P[r | s,a] isan unknown probability distribution over rewards At each step t + Environment generates context/state s- ~ S + Agent selects action a € A + Environment offers reward 7 ~ Repay Goal: Maximize cumulative reward Y7_ Epsilon-Greedy * Good news: e-Greedy works without any change! Select arma that maximizes FSB) FO: your favorite reese) Upper Confidenc@' sear" * Learn @ through Ordinary Least Squares (OLS). + Training data= (Se, 42),Teeenzr Mean reward from arma, G(s,a) = $(s,a)’ 6° O° isthe OLS estimate How do we transform it into a bandit algorithm? * Which arm to explore? Explore arms that we are least confident about. * Capture confidence through variance of G(s,a) Pull the arm a” = argmax (a0 t+c* |Var(qs, ®)) Thompson Sampling? Receive context x, Draw 6¢ according to P(@|D) Select a; = argmax, E,.(r|x1, a, 6°) Observe reward r; D=DU (21, a1,7%) end for °e Thompson Sampling? D=0 Posterior inference can be Receive context x; Draw 0 according tXE(@]D) > intractable Select a; = arg max, purchase/no purchase + P(r |5,,8 )isoften modeled as logistic Observe reward 7 mers D=DU (2, a1,%) + pOO1D,) doesthave any clsedform end for Use Gibbs sampling °e e os Contextual Bandigem=aummerey Epsilon-greedy + Easy to implement + Gives explicit control on what to explore + Tuning ¢ is more of an art UCB + Computationally expensive : 0(k®) computation, k: feature dimension + May suffer from delayed feedback Thompson Sampling Drawing samples is computationally expensive but using them in decision making is less expensive than UCB : O(c) computation More resilient to delayed feedback No need to tune exploration budget unlike epsilon-greedy Reinforcement Learning Reinforcement Learning 1. RL Concepts 2. Approaches to RL 3. Deep RL Reinforcement Learning State, Reward, Action, Stimulus, Gain, Payoff, Response, Situation Cost Control (world) Policy * Deterministic Policy * Mapping of states to actions: a = (s) * Agent takes an action: A, = m(S;) * Stochastic Policy + w(a| s) isa probability distribution + Ar ~m(a|S,=s) * Agent’s goal is to choose each action so as to maximize the discounted sum of future rewards * Resa + Resa + ¥7Re43t --- (y: discount factor) + Biko Revise » Action-Value Function Expected long term reward: = * From state s * With action a * Under policy 1 Revs Rees * With discount factor y an(S,4) = ElResa + YReaz + Y?Resat ---|S¢ = 5,Ae = @Arsrco~ M1] r Action-Value Function PSIs.a) —x(a'ts") ‘ Expected long term reward: At * From state s ® * With action a . wN * Under policy Rea Reva * With discount factor Qu (S,a) = E[Revi + VRev2 +? Reagt «++ | Se = 8, Ap = @ Actr:co~ T] =2 p(s'ls.a) fr + y Dala'ts'Yae(s'.a')] Optimal Value Functions * An optimal value function is the maximum achievable value q.(s,a) = maxq,(s,a) for alls €S,a€ A(s) © * Once we have q. we can act optimally to derive optimal policy m*(s) = argmaxq. (s,a) a Bellman Optimality Equation Bellman Expectation Equation mets.) an(s,a) = p(s'|s,a) [r +7 Laa'ls') qn(s',a’)] 6..o ~ (a's!) qu(s,a) =% pG'ls,a) fr +y mbox q.(s',a')) Value-based RL * Tabular Methods: In tasks with small, finite state sets value functions can be approximated using tables (entry for each state/state-action pair). * Function Approximation: Take examples from a desired function (e.g., value function) and generalize frommthem to construct approximation of the entire function. Q-Learning (Watkins & Dayan, 92] + A form of model-free method * Agent doesn’t need to have model of the environment * Iteratively approximates the optimal action-value function, q. * Estimates are updated as agent gathers more experience * Off-policy learning method * Learns the value of its determin igedy (target) policy from experience collected using exploration (behavior) policy * Target policy is the policy being learned about * Behavior policy is the policy generating the trajectory data * Convergence: If all pairs (s, a) are continuously updated, q converges to q G Learning Experience: Sample sequences of states, actions and rewards from on-line or simulated interaction with an environment! Learned action-value function Q directly approximates the optimal action-value function q. Actual Gy bes Pe nG) amiuany toGembar Oo » + Repeat fogeach episode: + Initialize $ + Repeat (for each step of the episode) + Choose A from S using policy derived from Q (eg., e- greedy) + Take action A, observe R, S’ + Q(S,A) = a Q(S,A) + (1 a) (R + ymax Q(s',a)) easy * Until S is terminal Q-Learning Update + What we want to compute (Bellman Optimality Eq.): Q(S, A) = Ey [r + y max Q(s’,a’)] 2 * Using the sampling approach, we take a sample of the quantity under expectation r+ymax Q(s',a’) " * Consider a decayed approach where we combine old value with target Q(S,A):= @ Q(S, A) + (1 — a) (7 + y max Q(S',a)) Q(S,A):= Q(S,A) + Aa) [r+ ymax Q(S',a) - QS,A)] Policy-based RL Episode: t, ay mu Policy: we © To © n What is the expected return for an episode? J(g) = E, (R()) P(E) = (50) - odo | $0) PCS4 | da) Moy I Si) PCSr | Spar r-1) RO =n tht = [cri 0)- RG) ae Learn policy m* that maximizes the expected return n* = argmax (m9) to Policy-based RL Policy update equation momat+aA-Vo (me) How to calculate the policy gradient Vo J(79)? Policy gradient theorem T Vo J(m9) = Ex > Va loge (at Ist) - R(t) = Policy based RL : REINFORCE Monte Carlo approximation Vo Jae) = Ex y Vo logs (ae Is) - weo| ~ D1 Vo logmee(ar Ise) - RC) REINFORCE algorithm: » 1. sample {7‘} from 79(a;|s;) (run the policy) 2. Vod (8) © YY; (Oy Vo log mo(aj|s})) (2, (st, a4) 3.0 04+ aVoJ(0) Q-learning vs. REINFORCE Uses Bellman optimality condition to learn optimalQ Uses Monte Carlo approximation to learn the policy function. Policy is derived implicitly from the Q function. directly. Computational inefficient for large action spaces and _Difficult to converge due to slow rate of learning. continuous actions. Off-policy learning, On-policy learning High bias due to the TD approximation step. High variance due to MC sampling Breakthroughs using Deep RL (DeepMind) Playing Atari with Deep Reinforcement Learning (2013) * Play Atari 2600 video games by just observing screen pixel + Receiving reward when games score increased! * Same model architecture, 7 different games Human-level control through Deep Reinforcement Learning (2015) + Featured on cover of Nature! * Same model architecture, 49 different games, expert human-level performance Breakthroughs using Deep RL (DeepMind) chess Shos! AlphaZero (2017) * Single system that taught itself from scratch to master Chess, shogi (Japanese chess) and Go. AlphaStar (2019) + First Al to reach top league of StarCraft Il, a real- time strategy video game. + Achieved a Grandmaster level for all three StarCraft II races: Protoss, Terran, and Zerg. * Neural networks, self-play via RL, multi-agent learning, imitation learning * reward | 1,4 Why Deep Learning? * Take last 4 screen images, resize to 84 X 84 pixels * Convert to grayscale with 256 gray levels * Possible game states? 25684x84x4 1067970 1 * * Represent the value function by a parameterized function approximator (Deep Neural Network), with parameter w € R”™ * G(s,a; w) Value Function Approximation DL in a nutshell! * A general purpose framework for representation learning: * Given an objective + Learn representation required to achieve objective * Directly from raw inputs * Using minimal domain knowledge Q-Learning Revisited * Q-Learning update Q(S, A) = E, Ir +y max Q(s', a a * Treat right-hand side r + y max Q(s',a';w) asa tatget * Minimize MSE loss by stochastic gradient descent l= (" +ymax Q(s',a'3w) — Q(s,a; w) * Converges to q,.using table look-up representation Deep Q-Learning: Q-network Represent value function by Q-network with weights w Q(s,a;w) = Q*(s,a) [Link]) —Q{s,a,,w) Deep Q-Learning: Experience Replay * To remove correlations, build data-set from agent’s own experience 51, a1, 12,52 92, a2, 13,53 53,83, 14, 54 Se, de, Fess Sea1 > [Se Aes Petty Sepa * Sample experiences from data-set and apply update Ls (r+ ymgxQ(s',asw-) — Qs,a; wy + For faster convergence, target parameters v~ held fixed. Deep Q-Learning: Algorithm Initialize replay memory D Initialize action-value function Q with random weight Initialize target action-value function Q with weights @ for episode = I to M do Initialize sequence s; = {-r1} and preprocessed sequence dy = 6(s1) for t= 110 T do ‘random action with probability € argmax, Q(¢(s;),a;0) otherwise Execute action a; in emulator and observe reward ry and image 2141 Set 8141 = 845 4,211 and preprocess Ge-1 = O(8¢41) Store transition (1, 2,7, 011) in D // experience replay Sample random minibatch of transitions (¢;,3,73,@)41) from D sety, {7 if episode terminates at step j +1 Ye Vrs + -ymaxe O(dj41,0%507) otherwise Perform a gradient descent step on (y; — Q(¢j, aj; ))? [Link]. the network parameter @ 7 per update of target network Every C steps reset Q = Q, ice., set 0~ = 0 Following greed pty, select ox ~ { DQN in Atari Conclusion * Multi-armed bandits (and contextual bandits) * Epsilon-greedy + UCB + Thompson sampling * Reinforcement learning + Qlearning + Deep QL

You might also like