Code No.
: 20CAE01
CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY (Autonomous)
B.E. (CSE-AI&ML) V Sem (Main) Examination Dec 2022 / Jan 2023
Reinforcement Learning
Time: 3 Hours Max Marks: 60
Note: Answer ALL questions from Part-A & Part –B (Internal Choice) at one place in the
same order
Part - A
(5Q X 3M = 15 Marks)
M CO BT
1 Provide the reasons why Reinforcement learning is different from (3) 1 2
supervised learning and unsupervised learning?
2 What is Gradient Descent (GD) and its variants? (3) 2 1
3 Compare the first-visit MC method and the every-visit MC method. (3) 3 2
4 What is State-space planning? What are the uses of it? (3) 4 1
5 What is d function approximation? (3) 5 2
Part – B
(5Q X 9M = 45 Marks)
M CO BT
6 (a) Explain some of the examples and possible applications of (5) 1 2
reinforcement learning that guides us to understand its development.
(b) Suppose the reinforcement learning player was greedy, that is, it (4) 1 3
always played the move that brought it to the position that it rated the
best. Might it learn to play better, or worse, than a nongreedy player?
What problems might occur?
(OR)
7 (a) What is a k-armed Bandit Problem? Explain the role of Action-value (5) 1 3
methods in dealing with the Bandit Problem.
(b) Explain how to tackle a Nonstationary Problem. (4) 1 2
8 (a) Is the MDP framework adequate to usefully represent all goal-directed (5) 2 3
learning tasks? Can you think of any clear exceptions?
(b) Draw and describe the optimal state-value function for the golf (4) 2 2
example.
(OR)
9 (a) Describe the Efficiency of Dynamic Programming. (5) 2 2
(b) Write the Policy Iteration for estimating π ≈ π∗ through iterative policy (4) 2 3
evaluation.
10 (a) Describe Monte Carlo Control without Exploring Starts (ES). (5) 3 2
(b) Illustrate the backup diagram for Monte Carlo estimation of 𝑞π . (4) 3 3
(OR)
11 (a) What is the Optimality of TD (0)? Explain about Random walk under (5) 3 3
batch updating.
(b) Re-solve the windy grid world assuming eight possible actions, (4) 3 2
including diagonal moves, rather than four. How much better can you
do with the extra actions?
Page 1 of 2
Code No.: 20CAE01
4
12 (a) Illustrate the backup diagrams of n-step TD Prediction methods. (5) 4 3
Describe the formation of spectrum ranging from one-step TD
methods to Monte Carlo methods.
(b) Is off policy learning possible without importance sampling? Explain (4) 3
how the n-step tree-backup algorithm is used to achieve this.
(OR)
13 (a) Why did the Dyna agent with exploration bonus, Dyna-Q+, perform (5) 4 3
better in the first phase as well as in the second phase of the blocking
and shortcut experiments?
(b) Write the steps in Prioritized sweeping for a deterministic (4) 4 1
environment.
14 (a) Describe about the on-policy distribution in episodic tasks. (5) 5 2
(b) Explain eligibility traces for action centric methods. (4) 5 3
(OR)
15 (a) Explain policy approximation in action centric methods. (5) 5 2
(b) Explain value prediction with function approximation. (4) 5 2
******
Page 2 of 2