reinforcement_learning_exer…/exercises.md

<!-- Exercises from Reinforcement Learning: An Introduction, Barto et al 2018 -->
# Table of content
- [Table of content](#table-of-content)
- [Exercise 1.1: Self-Play](#exercise-11-self-play)
- [Exercise 1.2: Symmetries](#exercise-12-symmetries)
- [Exercise 1.3: Greedy Play](#exercise-13-greedy-play)
- [Exercise 1.4: Learning from Exploration](#exercise-14-learning-from-exploration)
- [Exercise 2.1](#exercise-21)
- [Exercise 2.2](#exercise-22)
- [Exercise 2.3](#exercise-23)
- [Exercise 2.4](#exercise-24)
- [Exercise 2.5 (programming)](#exercise-25-programming)
- [Exercise 2.6: Mysterious Spikes](#exercise-26-mysterious-spikes)
- [Exercise 2.7: Unbiased Constant-Step-Size Trick](#exercise-27-unbiased-constant-step-size-trick)
- [Exercise 2.8: UCB Spikes](#exercise-28-ucb-spikes)
- [Exercise 2.9](#exercise-29)
- [Exercise 2.10](#exercise-210)
- [Exercise 2.11 (programming)](#exercise-211-programming)
- [Exercise 3.1](#exercise-31)
- [Exercise 3.2](#exercise-32)
- [Exercise 3.3](#exercise-33)
- [Exercise 3.4](#exercise-34)
- [Exercise 3.5](#exercise-35)
- [Exercise 3.6](#exercise-36)
- [Exercise 3.7](#exercise-37)
- [Exercise 3.8](#exercise-38)
- [Exercise 3.9](#exercise-39)
- [Exercise 3.10](#exercise-310)
- [Exercise 3.11](#exercise-311)
- [Exercise 3.12](#exercise-312)
- [Exercise 3.13](#exercise-313)
- [Exercise 3.14](#exercise-314)
- [Exercise 3.15](#exercise-315)
- [Exercise 3.16](#exercise-316)
- [Exercise 3.17](#exercise-317)
- [Exercise 3.18](#exercise-318)
- [Exercise 3.19](#exercise-319)
- [Exercise 3.20](#exercise-320)

# Exercise 1.1: Self-Play
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*

It would most certainly learn a different policy for selecting moves, since its opponent is not the same any more. It would then slowly learn the values of game states $V(S_t)$, causing it also to play against a tougher opponent (itself). Since it will learn slowly, I guess the update rule
$$ V(S_t) \leftarrow v(S_t) + \alpha [V(S_{t+1}) - V(S_t)] $$
will accomodate for the changing opponent, and it would slowly but surely become really good at tic-tac-toe.

# Exercise 1.2: Symmetries
*Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?*

By taking the symmetry of the game into account, we can greatly reduce the set of game states $S$. Because the set is smaller, the learning algorithm would learn the values $v_*(S_t)$ faster. (Less choices to make, the same states would get more value updates.)

However: if the opponent is not a perfect player, and he does not take into account symmetries himself, there might be certain game states with a high value that will get a lower value estimate because of their symmetry to other low-value game states. Suppose that we have a symmetric state $S_s$ that actually comprises 4 different rotated game states:
$$ S_s = \{S_1, S_2, S_3, S_4\} $$
With values :
$$v_*(S_1) = 0.5,\, v_*(S_2) = 0,\, v_*(S_3) = 0,\, v_*(S_4) = 0$$

With symmetries, the estimated value $V(S_s)$ would be
$$\frac{\sum_{i=1}^4{v_*(S_i)}}{|S_s|}=\frac{0.5}{4}=0.125$$
Instead of focusing on state $S_1$ with a high value, the algorithm might prefer other states with scores higher than $0.125$.

When playing against imperfect players it is not necessarily advantageous to take symmetries into account.

# Exercise 1.3: Greedy Play
*Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?*

A greedy player would not be able to search the complete state space, causing its value estimates $V(S_i)$ to be way off. It would always choose a random state with a value estimate that is higher than the other estimates, and never explore further states.

A greedy player would quite certainly be worse than a nongreedy player that does a *small* amount of exploration.

# Exercise 1.4: Learning from Exploration
*Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?*

If we learn from exploratory moves, the state values of above-average states would be lower. Depending on the step-size parameter $\alpha$, the tendency to explore, and the average value of the different lower-valued states.

# Exercise 2.1
*In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon = 0.5$, what is the probability that the greedy action is selected?*

Let set $A = \{A_1,\,A_2\}$ be the set of actions where $V(A_1) > V(A_2)$. The chance $P(A_1)$ of selecting greedy action $A_1$ is then:
$$P(A_1)=(1-\varepsilon) + \varepsilon \frac{1}{|A|}=\left(1-0.5\right)+0.5 \cdot \frac{1}{2}=0.75$$

# Exercise 2.2
**Bandit example.* Consider a $k$-armed bandit problem with $k=4$ actions, denoted $1, 2, 3, 4$. Consider applying to this problem a bandit algorithm usig $\varepsilon$-greedy action selection, sample-average action-value estimates, and initial estimates of $Q_1(a)=0$ for all $a$. Suppose the initial sequence of actions and rewards is $A_1=1, R_1=-1, A_2=2, R_2=1, A_3=2, R_3=-2, A_4=2, R_4=2, A_5=3, R_5=0$. On some of these time steps the $\varepsilon$ case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?*

Let's go over the estimates $Q_t(a)$ step by step, using the rule
$$Q_t(a)=\frac{\sum_{i=1}^{t-1}{R_i\cdot\mathbb{1}_{A_i=a}}}{\sum_{i=1}^{t-1}\mathbb{1}_{A_i=a}}$$

>$t=0$
>$$Q_0(1)=0, Q_0(2)=0, Q_0(3)=0, Q_0(4)=0$$

>$t=1$
>$A_1=1, R_1=-1$
>$$Q_1(1)=-1, Q_1(2)=0, Q_1(3)=0, Q_1(4)=0$$

>$t=2$
>$A_2=2, R_2=1$
>$$Q_2(1)=-1, Q_2(2)=1, Q_2(3)=0, Q_2(4)=0$$

Action 2 is now the best greedy action to take, and should be taken the next action unless the $\varepsilon$ case occurrs.

>$t=3$
>$A_3=2, R_3=-2$
>$$Q_3(1)=-1, Q_3(2)=-0.5, Q_3(3)=0, Q_3(4)=0$$

As expected, action $2$ was selected. This might also have been an $\varepsilon$ case, which by chance selected the highest valued action $2$. The best actions to take in the greedy case are now $3$ and $4$.

>$t=4$
>$A_4=2, R_4=2$
>$$Q_4(1)=-1, Q_4(2)=\frac{1}{3}, Q_4(3)=0, Q_4(4)=0$$

Since action $2$ was selected, while it had a lower than average estimated value, this must have been an $\varepsilon$ case!

>$t=5$
>$A_5=3, R_5=0$
>$$Q_5(1)=-1, Q_5(2)=\frac{1}{3}, Q_5(3)=0, Q_5(4)=0$$

Since action $3$ was selected, while it had a lower than average estimated value, this must have been an $\varepsilon$ case!

We can say with certainty that timesteps $t=4$ and $t=5$ were $\varepsilon$ cases. In an exploratory case, it's also possible that the highest valued action is chosen. This means that it's never possible to prove that a case was not exploratory. All of the timesteps $1,2,3,4,5$ might have been exploratory cases.

# Exercise 2.3
*In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.*

In Figure 2.1 we can see that action $3$ has the highest average reward: $q_*(3)\approxeq1.55$.

Any $\varepsilon$-greedy method with $\varepsilon>0$ will in the long run learn the mean rewards $q_*(a)$:
$$
\lim_{t \to \infty} Q_t(a) = q_*(a)\;|\;\varepsilon>0
$$

This means that when step $t$ goes to infinity, the chance to select the most rewarding action $3$ only depends on $\varepsilon$:
$$
\lim_{t \to \infty} P(A_t=3)=(1-\varepsilon)+\varepsilon\cdot\frac{1}{|A|}\;|\;\varepsilon>0
$$

The average reward $R_t$ with step $t$ going to infinity would then be:
$$
\lim_{t \to \infty} R_t =
\lim_{t \to \infty} P(A_t=3)\cdot q_*(3)=
\left((1-\varepsilon)+\varepsilon\cdot\frac{1}{|A|}\right)\cdot q_*(3)\;|\;\varepsilon>0
$$

We can see that methods with a lower $\varepsilon$ will get a higher average reward in the distant future. We can calculate the average reward $R_t$ for the two values of $\varepsilon$ given in Figure 2.2:
$$
\varepsilon = 0.1 \Rightarrow
\lim_{t \to \infty}R_t = 1.55\cdot \left((1-0.1)+ 0.1 \cdot \frac{1}{10} \right)=
1.55\cdot0.91\approxeq1.41
$$

$$
\varepsilon = 0.01 \Rightarrow
\lim_{t \to \infty}R_t = 1.55\cdot \left((1-0.01)+ 0.01 \cdot \frac{1}{10} \right)=
1.55\cdot0.991\approxeq1.536
$$

# Exercise 2.4
*If the step-size parameters, $\alpha_n$, are not constant, then the estimate $Q_n$ is a weighted average of previously received rewards with a weighting differen from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?*

$$\begin{aligned}
Q_{n+1} = & \; Q_n + \alpha_n \left[R_n - Q_n\right] \\
= & \; \alpha_n R_n +\left(1-\alpha_n\right)Q_n \\
= & \; \alpha_n R_n+ (1-\alpha_n)\left[\alpha_{n-1}R_{n-1}+(1-\alpha_{n-1})Q_{n-1}\right] \\
= & \; \alpha_n R_n+(1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})Q_{n-1} \\
= & \; \alpha_n R_n+(1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\left[\alpha_{n-2}R_{n-2}+(1-\alpha_{n-2})Q_{n-2}\right] \\
= & \; \alpha_n R_n+(1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\alpha_{n-2} R_{n-2} \\
 & + (1-\alpha_n)(1-\alpha_{n-1})(1-\alpha_{n-2})Q_{n-2} \\
= & \; \sum_{i=1}^{n}\left[R_i\alpha_i\prod_{j=i+1}^{n}(1-\alpha_{j})\right] + Q_1\prod_{i=1}^n(1-\alpha_i)
\end{aligned}$$

# Exercise 2.5 (programming)
See [exercise_2-5.py](./exercise_2-5.py) for code.
![](./exercise_2-5.png)

# Exercise 2.6: Mysterious Spikes
*The results shown in Figure 2.3 should be quite reliable
because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks.
Why, then, are there oscillations and spikes in the early part of the curve for the optimistic
method? In other words, what might make this method perform particularly better or
worse, on average, on particular early steps?*

While the optimistic initial estimates are all moving down and the algorithm is exploring, the estimated value of the best action will lower the least, causing it to be chosen more often than the other ones. (This causess the peak.) Then, right after the estimate of the best action will become lower than some other action estimates (these might still be very optimistic), less optimal actions will be chosen more. (Causing the dip after the peak.) This effect seems to fade out quite fast.

# Exercise 2.7: Unbiased Constant-Step-Size Trick
*Carry out an analysis like that in (2.6) to show that Qn is an exponential recency-weighted
average without initial bias.*

<!-- $$\begin{aligned}
\bar{o}_n = & \; \bar{o}_{n-1} + \alpha(1-\bar{o}_{n-1}) \\
= & \; \alpha + (1-\alpha)\bar{o}_{n-1} \\
= & \; \alpha + (1-\alpha)\left[\alpha + (1-\alpha)\bar{o}_{n-2}\right] \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\bar{o}_{n-2} \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\left[\alpha + (1-\alpha)\bar{o}_{n-3}\right] \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\alpha+(1-\alpha)^3\bar{o}_{n-3} \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\alpha+\dots+
(1-\alpha)^{n-1}\alpha+(1-\alpha)^{n}\bar{o}_{0} \\
= & \; \sum_{i=0}^{n-1}{\alpha(1-\alpha)^i} \\
= & \; \alpha\sum_{i=0}^{n-1}{(1-\alpha)^i} \\
\end{aligned}$$

$$\begin{aligned}
\beta_n\doteq & \; \frac{\alpha}{\bar{o}_n} \\
= & \; \frac{\alpha}{\alpha\sum_{i=0}^{n-1}{(1-\alpha)^i}} \\
= & \; \frac{1}{\sum_{i=0}^{n-1}{(1-\alpha)^i}} \\
\end{aligned}$$ -->

$$\begin{aligned}
Q_{n+1} = & \; Q_n + \beta_n \left[R_n - Q_n\right] \\
= & \; \beta_n R_n+\left(1-\beta_n\right)Q_n \\
= & \; \frac{\alpha}{\bar{o}_n} R_n+\left(1-\frac{\alpha}{\bar{o}_n}\right)Q_n \\
\Rightarrow \bar{o}_n Q_{n+1} = & \; \alpha R_n+\left(\bar{o}_n-\alpha\right)Q_n \\
= & \; \alpha R_n+\left(\alpha + (1-\alpha)\bar{o}_{n-1}-\alpha\right)Q_n \\
= & \; \alpha R_n+\left(1-\alpha\right)\bar{o}_{n-1}Q_n \\
= & \; \alpha R_n+\left(1-\alpha\right)\left[\alpha R_n+(1-\alpha)\bar{o}_{n-2}Q_{n-1}\right] \\
= & \; \alpha R_n+\left(1-\alpha\right)\alpha R_n+(1-\alpha)^2\bar{o}_{n-2}Q_{n-1} \\
= & \; \left[\sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i}\right]+(1-\alpha)^n\bar{o}_0Q_1 \\
= & \; \sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i} \\
\Rightarrow Q_{n+1} = & \; \beta_n \sum_{i=0}^{n-1}{ R_n(1-\alpha)^i}
\end{aligned}$$
As this expression does not depend on $Q_1$ there should be no initial bias.

# Exercise 2.8: UCB Spikes
*In Figure 2.4 the UCB algorithm shows a distinct spike
in performance on the 11th step. Why is this? Note that for your answer to be fully
satisfactory it must explain both why the reward increases on the 11th step and why it
decreases on the subsequent steps.
Hint: If c = 1, then the spike is less prominent.*

The graph was generated with $c=2$, this value is also used in this example.
The amount of actions $|A| = 10$. The highest average action reward $q_* \approxeq 1.55$. (Gaussian distribution.)

As written on page 36: *If $N_t(a) = 0$, then a is considered to be a maximizing action.*
This means that up to timestep 10 all the different actions will be explored, there won't be any "greedy" selection for one of the better actions.
We can calculate the second term in equation $2.10$ for different timesteps.

$$\begin{aligned}
& A_t \doteq \argmax_a\left[ Q_t(a)+c\sqrt{\frac{\ln t}{N_t(a)}} \; \right] \; (2.10) \\
& t = 10 \Rightarrow c\sqrt{\frac{\ln t}{N_t(a)}} = 2 \sqrt{\ln{10}} \approxeq 3.03
\end{aligned}$$

At timestep 10 all the actions will have been explored exactly once. Their second terms $c\sqrt{\frac{\ln t}{N_t(a)}}$ will thus all have the same value. Since the action selection maximizes equation 2.10, only the term $Q_t(a)$ will decide what action is chosen. The action with the highest single reward will thus always be chosen at timestep 11.

For the action that has been chosen twice, the second term will now be:
$$
t = 11 \Rightarrow c\sqrt{\frac{\ln t}{N_t(a)}} = 2 \sqrt{\frac{\ln{11}}{2}} \approxeq 2.19
$$
While for the other actions that have been chosen only once, the term will be:
$$
t = 11 \Rightarrow c\sqrt{\frac{\ln t}{N_t(a)}} = 2 \sqrt{\ln{11}} \approxeq 3.10
$$
All of the non-optimal values will thus be preferred from the 12th timestep up until around the 20th. (Causing the dip.)

We can conclude that with high values of $c$ the exploration term will dictate the actions chosen during the earlier timesteps. If the value of $c$ was chosen lower in the previous example, we would expect the values of $Q_t(a)$ to be the same order of magnitude as the exploration term during the earliest timesteps, thus making the peak and dip less prominent.

# Exercise 2.9
*Show that in the case of two actions, the soft-max distribution is the same
as that given by the logistic, or sigmoid, function often used in statistics and artificial
neural networks.*

The sigmoid function:
$$
S(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}
$$

In the case of two actions, $a$ and $b$:
$$\begin{aligned}
\pi_t(a)&=\frac{e^{H_t(a)}}{e^{H_t(a)}+e^{H_t(b)}} \\
&= \frac{e^{H_t(a)-H_t(b)}}{e^{H_t(a)-H_t(b)}+1} \\
\Rightarrow \pi_t(x)&= \frac{e^{x}}{e^{x}+1} \; | \; x=H_t(a)-H_t(b) \\
\end{aligned}$$

# Exercise 2.10
*Suppose you face a 2-armed bandit task whose true action values change
randomly from time step to time step. Specifically, suppose that, for any time step, the
true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A),
and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you
face at any step, what is the best expectation of success you can achieve and how should
you behave to achieve it? Now suppose that on each step you are told whether you are
facing case A or case B (although you still don’t know the true action values). This is an
associative search task. What is the best expectation of success you can achieve in this
task, and how should you behave to achieve it?*

The average reward is the same for both arms (0.5, both cases combined). While not knowing what case we face, the best course of action would be to select a random action and get an average reward of around 0.5.

If we know what case we are facing, it's possible to split the problem into two separate multi armed bandit problems. This means that for case A we'd choose action 2, resulting in a value of 0.2. In cas B we'd go for action 1, resulting in a value of 0.9. Optimally, we would be able to get an expected return of $0.5 \cdot(0.2+0.9)=0.55$.

# Exercise 2.11 (programming)
#TODO

# Exercise 3.1
*Devise three example tasks of your own that fit into the MDP framework,
identifying for each its states, actions, and rewards. Make the three examples as different
from each other as possible. The framework is abstract and flexible and can be applied in
many different ways. Stretch its limits in some way in at least one of your examples.*

1) Little quadruped robot that needs to run around.
   * Actions: joint torques or positions.
   * State: robot position and joint positions.
   * Reward: how far the robot has run.
2) Stock market trading machine.
   * Actions: buying and selling stock
   * State: current stock market prices indexes, etc...
   * Reward: increase in value of the portfolio.
3) Roof window controller.
   * Actions: opening or closing the window by a certain amount of degrees.
   * State: current room temperature, humidity, rain sensor
   * Reward: deviation on temperature setpoint + strong negative reward if open more than x degrees when raining.

# Exercise 3.2
*Is the MDP framework adequate to usefully represent all goal-directed
learning tasks? Can you think of any clear exceptions?*

#TODO

# Exercise 3.3
*Consider the problem of driving. You could define the actions in terms of
the accelerator, steering wheel, and brake, that is, where your body meets the machine.
Or you could define them farther out—say, where the rubber meets the road, considering
your actions to be tire torques. Or you could define them farther in—say, where your
brain meets your body, the actions being muscle twitches to control your limbs. Or you
could go to a really high level and say that your actions are your choices of where to drive.
What is the right level, the right place to draw the line between agent and environment?
On what basis is one location of the line to be preferred over another? Is there any
fundamental reason for preferring one location over another, or is it a free choice? *

It depends on what kind of agent you want to get. If you want to create an agent that can physically drive a car using muscles, you will have to go pretty low-level. While if you would like to create an agent able to go from point a to point b the actions will be on a higher level.

# Exercise 3.4
*Give a table analogous to that in Example 3.3, but for $p(s'
, r|s, a)$. It
should have columns for $s, a, s'
, r$, and $p(s'
, r|s, a)$, and a row for every 4-tuple for which
$p(s', r|s, a) > 0$.*

| $s$  | $a$      | $s'$ | $r$          | $p(s', r\vert s,a)$ |
| ---- | -------- | ---- | ------------ | ------------------- |
| high | search   | high | $r_{search}$ | $\alpha$            |
| high | search   | low  | $r_{search}$ | $1-\alpha$          |
| high | wait     | high | $r_{wait}$   | $1$                 |
| low  | search   | high | $-3$         | $1-\beta$           |
| low  | search   | low  | $r_{search}$ | $\beta$             |
| low  | wait     | low  | $r_{wait}$   | $1$                 |
| low  | recharge | high | $0$          | $1$                 |

# Exercise 3.5
*The equations in Section 3.1 are for the continuing case and need to be
modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).*

Original:
$$\sum_{s'\in S}\sum_{r\in R}p(s', r \vert s, a)=1, for \; all \; s\in S, a\in A(s)$$

Since we now need to differ between all the nonterminal states $S$ and the set including the terminal states $S^+$ this becomes:

$$
\sum_{s'\in S^+}\sum_{r\in R}p(s', r \vert s, a)=1, for \; all \; s\in S, a\in A(s)
$$

Since the new state $s'$ can now also be a terminal state.

# Exercise 3.6
*Suppose you treated pole-balancing as an episodic task but also used
discounting, with all rewards zero except for -1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing
formulation of this task?*

Episodic, discounted:
$$
G_t=-\gamma^{T-1}
$$

Continuous, discounted. Suppose failure at time step $K$ and recovery never happens.

Given $t \geq K$:
$$\begin{aligned}
G_t & = - 1 - \gamma - \gamma^2 \dots \\
&= -1\sum_{k=0}^{\infty}\gamma^k \\
&= \frac{-1}{1-\gamma}
\end{aligned}$$

Given $t < K$:
$$\begin{aligned}
G_t & = -\gamma^{K-t}(1+\gamma+\gamma^2+\dots) \\
& = \frac{\gamma^{K-t}}{1-\gamma} \\
\end{aligned}$$

# Exercise 3.7
*Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing
no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?*

Does the episode terminate after a certain max amount of timesteps?
Even if that's the case, the reward for the agent is very sparse. The agent needs to be able to complete the maze in order to get any reward. In all other cases the agent will get a reward of 0. In other words: the agent doesn't learn to complete mazes, because it will only get a reward if it can already complete the maze. (No further improvement of reward is possible.) Perhaps a negative reward for every time step spent in the maze would help the agent learn.

# Exercise 3.8
*Suppose $\gamma = 0.5$ and the following sequence of rewards is received $R_1=-1, R_2=2, R_3=6, R_4=3$, and $R_5=2$, with $T=5$. What are $G_0, G_1, \dots, G_5$? Hint: Work backwards.*

$$\begin{aligned}
G_5 &= 0 \\
G_4 &= R_5 = 2 \\
G_3 &= R_4 + \gamma G_4 = 4 \\
G_2 &= R_3 + \gamma G_3 = 8 \\
G_1 &= R_2 + \gamma G_2 = 6 \\
G_0 &= R_1 + \gamma G_1 = 2 \\
\end{aligned}$$

# Exercise 3.9
*Suppose $\gamma=0.9$ and the reward sequence is $R_1=2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?*

$$\begin{aligned}
G_1 &= 7 + 7\gamma + 7\gamma^2 \dots \\
&= 7(1+\gamma+\gamma^2\dots) \\
&= \frac{7}{1-\gamma} = 70
\end{aligned}$$

$$\begin{aligned}
G_0 &= 2 + \gamma G_1 \\
&= 65
\end{aligned}$$


# Exercise 3.10
*Prove the second equality in (3.10).*

Proving:
$$
\sum_{k=0}^{\infty}\gamma^k = \frac{1}{1-\gamma}
$$

$$\begin{aligned}
G_t & = 1 + \gamma + \gamma^2 + \dots \\
& = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
& = 1 + \gamma(G_t) \\
&\Rightarrow G_t - \gamma G_t = 1 \\
&\Rightarrow (1-\gamma)G_t = 1 \\
\Rightarrow G_t &= \frac{1}{1-\gamma}
\end{aligned}$$

# Exercise 3.11
*If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?*

$$
\mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a)
$$

# Exercise 3.12
*Give an equation for $v_{\pi}$ in terms of $q_{\pi}$ and $\pi$.*

$$
v_{\pi} = \sum_{a \in A(s)}\pi(a\vert s)\, q_{\pi}(s, a)
$$

# Exercise 3.13
*Give an equation for $q_{\pi}$ in terms of $v_{\pi}$ and the four-argument $p$.*

$$
q_{\pi}(s,a) = \sum_{s', r}p(s', r\vert s,a)\, v_{\pi}(s')
$$

# Exercise 3.14
*The Bellman equation (3.14) must hold for each state for the value function
$v_{\pi}$ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, 0.4, and +0.7. (These numbers are accurate only to one decimal place.)*

$s$ being the middle:

| $a$   | $\pi(a\vert s)$ | $p(s',r\vert s,a)$ | $r$ | $v_{\pi}(s')$ | Result    |
| ----- | --------------- | ------------------ | --- | ------------- | --------- |
| north | 0.25            | 1                  | 0   | 2.3           | 0.5175    |
| south | 0.25            | 1                  | 0   | -0.4          | -0.09     |
| east  | 0.25            | 1                  | 0   | 0.4           | 0.09      |
| west  | 0.25            | 1                  | 0   | 0.7           | 0.1575    |
|       |                 |                    |     | **Sum:**      | **0.675** |

# Exercise 3.15
1) *In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant, $v_c$, to the values of all states, and thus does not affect the relative values of any states under any policies.*

$$\begin{aligned}
G_t &= R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+13}+\dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\\
\Rightarrow \; G_{tc} &= R_{t+1}+c+\gamma (R_{t+2}+c)+\gamma^2 (R_{t+13}+c)+\dots \\
&= \sum_{k=0}^{\infty} \gamma^k(R_{t+k+1}+c) \\
&= \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} + \sum_{k=0}^{\infty} \gamma^kc \\
&= G_t + v_c \\
\Rightarrow v_{\pi c}(s) &= \mathbb{E}_{\pi}\left[G_{tc} \vert S_t = s\right] \\
&= \mathbb{E}_{\pi}\left[G_{t}+v_c\, \vert \, S_t = s\right] \\
&= \mathbb{E}_{\pi}\left[G_{t}\, \vert \, S_t = s\right] + v_c
\end{aligned}$$

2) *What is $v_c$ in terms of $c$ and $\gamma$?*

$$
v_c = \sum_{k=0}^{\infty} \gamma^kc
$$

# Exercise 3.16
*Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.*

At first guess: since the episodes do not necessarily have the same length, the offset might have a disproportionate effect.

$$\begin{aligned}
G_t &= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1} \\
G_{tc} &= \sum_{k=0}^{T-t} \gamma^{k} (R_{t+k+1}+c) \\
&= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1}
+ \sum_{k=0}^{T-t} \gamma^{k}c \\
&= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1} + v_c(t,T)
\end{aligned}$$

Since $v_c$ now depends on the start and end time step $t, T$ it's not possible to separate it from the expected value in the definition for the state-value function:

$$\begin{aligned}
v_{\pi c}(s) &= \mathbb{E}_{\pi}\left[G_{tc} \vert S_t = s\right] \\
&= \mathbb{E}_{\pi}\left[G_{t}+v_c(t,T)\, \vert \, S_t = s\right] \\
\end{aligned}$$

In other words: the offset on the expected value depends on the episode length!

# Exercise 3.17
*What is the Bellman equation for action values, that is, for $q_{\pi}$? It must give the action value $q_{\pi}(s, a)$ in terms of the action values, $q_{\pi}(s', a')$, of possible successors to the state–action pair $(s, a)$. Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.*

$$\begin{aligned}
q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
&= \mathbb{E}[R_{t+1} + \gamma G_{t+1}\vert S_t=s, A_t=a] \\
&= \sum_{s',r}p(s',r\vert s,a)\left(r + \gamma \sum_{a'} \pi(a'\vert s')q_{\pi}(s', a')\right) \\
\end{aligned}$$

# Exercise 3.18

$$\begin{aligned}
v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t\vert S_t=s] \\
&= \sum_{a\in A(s)} \pi(a\vert s)\, q_{\pi}(s,a) \\
\end{aligned}$$

# Exercise 3.19
$$\begin{aligned}
q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
&= \mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1}) \vert S_t=s, A_t=a] \\
&= \sum_{s',r} p(s',r\vert s,a)\left[r + \gamma v_{\pi}(s') \right] \\

\end{aligned}$$

# Exercise 3.20