<!-- Exercises from Reinforcement Learning: An Introduction, Barto et al 2018 -->
# Table of content
- [Table of content](#table-of-content)
- [Exercise 1.1: Self-Play](#exercise-11-self-play)
- [Exercise 1.2: Symmetries](#exercise-12-symmetries)
- [Exercise 1.3: Greedy Play](#exercise-13-greedy-play)
- [Exercise 1.4: Learning from Exploration](#exercise-14-learning-from-exploration)
- [Exercise 2.1](#exercise-21)
- [Exercise 2.2](#exercise-22)
- [Exercise 2.3](#exercise-23)
- [Exercise 2.4](#exercise-24)
- [Exercise 2.5](#exercise-25)

# Exercise 1.1: Self-Play
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*

It would most certainly learn a different policy for selecting moves, since its opponent is not the same any more. It would then slowly learn the values of game states $V(S_t)$, causing it also to play against a tougher opponent (itself). Since it will learn slowly, I guess the update rule
$$ V(S_t) \leftarrow v(S_t) + \alpha [V(S_{t+1}) - V(S_t)] $$
will accomodate for the changing opponent, and it would slowly but surely become really good at tic-tac-toe.

# Exercise 1.2: Symmetries
*Many tic-tac-toe positions appear different but are really the same because of symmetries. How might we amend the learning process described above to take advantage of this? In what ways would this change improve the learning process? Now think again. Suppose the opponent did not take advantage of symmetries. In that case, should we? Is it true, then, that symmetrically equivalent positions should necessarily have the same value?*

By taking the symmetry of the game into account, we can greatly reduce the set of game states $S$. Because the set is smaller, the learning algorithm would learn the values $v_*(S_t)$ faster. (Less choices to make, the same states would get more value updates.)

However: if the opponent is not a perfect player, and he does not take into account symmetries himself, there might be certain game states with a high value that will get a lower value estimate because of their symmetry to other low-value game states. Suppose that we have a symmetric state $S_s$ that actually comprises 4 different rotated game states:
$$ S_s = \{S_1, S_2, S_3, S_4\} $$
With values :
$$v_*(S_1) = 0.5,\, v_*(S_2) = 0,\, v_*(S_3) = 0,\, v_*(S_4) = 0$$

With symmetries, the estimated value $V(S_s)$ would be 
$$\frac{\sum_{i=1}^4{v_*(S_i)}}{|S_s|}=\frac{0.5}{4}=0.125$$
Instead of focusing on state $S_1$ with a high value, the algorithm might prefer other states with scores higher than $0.125$.

When playing against imperfect players it is not necessarily advantageous to take symmetries into account.

# Exercise 1.3: Greedy Play
*Suppose the reinforcement learning player was greedy, that is, it always played the move that brought it to the position that it rated the best. Might it learn to play better, or worse, than a nongreedy player? What problems might occur?*

A greedy player would not be able to search the complete state space, causing its value estimates $V(S_i)$ to be way off. It would always choose a random state with a value estimate that is higher than the other estimates, and never explore further states.

A greedy player would quite certainly be worse than a nongreedy player that does a *small* amount of exploration.

# Exercise 1.4: Learning from Exploration
*Suppose learning updates occurred after all moves, including exploratory moves. If the step-size parameter is appropriately reduced over time (but not the tendency to explore), then the state values would converge to a different set of probabilities. What (conceptually) are the two sets of probabilities computed when we do, and when we do not, learn from exploratory moves? Assuming that we do continue to make exploratory moves, which set of probabilities might be better to learn? Which would result in more wins?*

If we learn from exploratory moves, the state values of above-average states would be lower. Depending on the step-size parameter $\alpha$, the tendency to explore, and the average value of the different lower-valued states.

# Exercise 2.1
*In $\varepsilon$-greedy action selection, for the case of two actions and $\varepsilon = 0.5$, what is the probability that the greedy action is selected?*

Let set $A = \{A_1,\,A_2\}$ be the set of actions where $V(A_1) > V(A_2)$. The chance $P(A_1)$ of selecting greedy action $A_1$ is then:
$$P(A_1)=(1-\varepsilon) + \varepsilon \frac{1}{|A|}=\left(1-0.5\right)+0.5 \cdot \frac{1}{2}=0.75$$

# Exercise 2.2
**Bandit example.* Consider a $k$-armed bandit problem with $k=4$ actions, denoted $1, 2, 3, 4$. Consider applying to this problem a bandit algorithm usig $\varepsilon$-greedy action selection, sample-average action-value estimates, and initial estimates of $Q_1(a)=0$ for all $a$. Suppose the initial sequence of actions and rewards is $A_1=1, R_1=-1, A_2=2, R_2=1, A_3=2, R_3=-2, A_4=2, R_4=2, A_5=3, R_5=0$. On some of these time steps the $\varepsilon$ case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur? On which time steps could this possibly have occurred?*

Let's go over the estimates $Q_t(a)$ step by step, using the rule
$$Q_t(a)=\frac{\sum_{i=1}^{t-1}{R_i\cdot\mathbb{1}_{A_i=a}}}{\sum_{i=1}^{t-1}\mathbb{1}_{A_i=a}}$$

>$t=0$
>$$Q_0(1)=0, Q_0(2)=0, Q_0(3)=0, Q_0(4)=0$$

>$t=1$  
>$A_1=1, R_1=-1$
>$$Q_1(1)=-1, Q_1(2)=0, Q_1(3)=0, Q_1(4)=0$$

>$t=2$  
>$A_2=2, R_2=1$
>$$Q_2(1)=-1, Q_2(2)=1, Q_2(3)=0, Q_2(4)=0$$

Action 2 is now the best greedy action to take, and should be taken the next action unless the $\varepsilon$ case occurrs.

>$t=3$  
>$A_3=2, R_3=-2$
>$$Q_3(1)=-1, Q_3(2)=-0.5, Q_3(3)=0, Q_3(4)=0$$

As expected, action $2$ was selected. This might also have been an $\varepsilon$ case, which by chance selected the highest valued action $2$. The best actions to take in the greedy case are now $3$ and $4$.

>$t=4$  
>$A_4=2, R_4=2$
>$$Q_4(1)=-1, Q_4(2)=\frac{1}{3}, Q_4(3)=0, Q_4(4)=0$$

Since action $2$ was selected, while it had a lower than average estimated value, this must have been an $\varepsilon$ case!

>$t=5$  
>$A_5=3, R_5=0$
>$$Q_5(1)=-1, Q_5(2)=\frac{1}{3}, Q_5(3)=0, Q_5(4)=0$$

Since action $3$ was selected, while it had a lower than average estimated value, this must have been an $\varepsilon$ case!

We can say with certainty that timesteps $t=4$ and $t=5$ were $\varepsilon$ cases. In an exploratory case, it's also possible that the highest valued action is chosen. This means that it's never possible to prove that a case was not exploratory. All of the timesteps $1,2,3,4,5$ might have been exploratory cases.

# Exercise 2.3
*In the comparison shown in Figure 2.2, which method will perform best in the long run in terms of cumulative reward and probability of selecting the best action? How much better will it be? Express your answer quantitatively.*

In Figure 2.1 we can see that action $3$ has the highest average reward: $q_*(3)\approxeq1.55$. 

Any $\varepsilon$-greedy method with $\varepsilon>0$ will in the long run learn the mean rewards $q_*(a)$:
$$
\lim_{t \to \infty} Q_t(a) = q_*(a)\;|\;\varepsilon>0
$$

This means that when step $t$ goes to infinity, the chance to select the most rewarding action $3$ only depends on $\varepsilon$:
$$
\lim_{t \to \infty} P(A_t=3)=(1-\varepsilon)+\varepsilon\cdot\frac{1}{|A|}\;|\;\varepsilon>0
$$

The average reward $R_t$ with step $t$ going to infinity would then be:
$$
\lim_{t \to \infty} R_t =
\lim_{t \to \infty} P(A_t=3)\cdot q_*(3)=
\left((1-\varepsilon)+\varepsilon\cdot\frac{1}{|A|}\right)\cdot q_*(3)\;|\;\varepsilon>0
$$

We can see that methods with a lower $\varepsilon$ will get a higher average reward in the distant future. We can calculate the average reward $R_t$ for the two values of $\varepsilon$ given in Figure 2.2:  
$$
\varepsilon = 0.1 \Rightarrow 
\lim_{t \to \infty}R_t = 1.55\cdot \left((1-0.1)+ 0.1 \cdot \frac{1}{10} \right)=
1.55\cdot0.91\approxeq1.41
$$

$$
\varepsilon = 0.01 \Rightarrow 
\lim_{t \to \infty}R_t = 1.55\cdot \left((1-0.01)+ 0.01 \cdot \frac{1}{10} \right)=
1.55\cdot0.991\approxeq1.536
$$

# Exercise 2.4
*If the step-size parameters, $\alpha_n$, are not constant, then the estimate $Q_n$ is a weighted average of previously received rewards with a weighting differen from that given by (2.6). What is the weighting on each prior reward for the general case, analogous to (2.6), in terms of the sequence of step-size parameters?*

$$\begin{aligned}
Q_{n+1} = & \; Q_n + \alpha_n \left[R_n - Q_n\right] \\
= & \; \alpha_n R_n +\left(1-\alpha_n\right)Q_n \\
= & \; \alpha_n R_n+ (1-\alpha_n)\left[\alpha_{n-1}R_{n-1}+(1-\alpha_{n-1})Q_{n-1}\right] \\
= & \; \alpha_n R_n+(1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})Q_{n-1} \\
= & \; \alpha_n R_n+(1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\left[\alpha_{n-2}R_{n-2}+(1-\alpha_{n-2})Q_{n-2}\right] \\
= & \; \alpha_n R_n+(1-\alpha_n)\alpha_{n-1}R_{n-1}+(1-\alpha_n)(1-\alpha_{n-1})\alpha_{n-2} R_{n-2} \\
 & + (1-\alpha_n)(1-\alpha_{n-1})(1-\alpha_{n-2})Q_{n-2} \\
= & \; \sum_{i=1}^{n}\left[R_i\alpha_i\prod_{j=i+1}^{n}(1-\alpha_{j})\right] + Q_1\prod_{i=1}^n(1-\alpha_i)
\end{aligned}$$

# Exercise 2.5
See [exercise_2-5.py](./exercise_2-5.py) for code.
![](./exercise_2-5.png)