add more exercises

This commit is contained in:
Bart Moyaers
2020-02-24 16:09:49 +01:00
parent 63e85d2a60
commit 6ecca40771

View File

@@ -10,6 +10,12 @@
- [Exercise 2.3](#exercise-23) - [Exercise 2.3](#exercise-23)
- [Exercise 2.4](#exercise-24) - [Exercise 2.4](#exercise-24)
- [Exercise 2.5](#exercise-25) - [Exercise 2.5](#exercise-25)
- [Exercise 2.6: Mysterious Spikes](#exercise-26-mysterious-spikes)
- [Exercise 2.7: Unbiased Constant-Step-Size Trick](#exercise-27-unbiased-constant-step-size-trick)
- [Exercise 2.8: UCB Spikes](#exercise-28-ucb-spikes)
- [Exercise 2.9](#exercise-29)
- [Exercise 2.10](#exercise-210)
- [Exercise 2.11 (programming)](#exercise-211-programming)
# Exercise 1.1: Self-Play # Exercise 1.1: Self-Play
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?* *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -144,3 +150,116 @@ Q_{n+1} = & \; Q_n + \alpha_n \left[R_n - Q_n\right] \\
See [exercise_2-5.py](./exercise_2-5.py) for code. See [exercise_2-5.py](./exercise_2-5.py) for code.
![](./exercise_2-5.png) ![](./exercise_2-5.png)
# Exercise 2.6: Mysterious Spikes
*The results shown in Figure 2.3 should be quite reliable
because they are averages over 2000 individual, randomly chosen 10-armed bandit tasks.
Why, then, are there oscillations and spikes in the early part of the curve for the optimistic
method? In other words, what might make this method perform particularly better or
worse, on average, on particular early steps?*
While the optimistic initial estimates are all moving down and the algorithm is exploring, the estimated value of the best action will lower the least, causing it to be chosen more often than the other ones. (This causess the peak.) Then, right after the estimate of the best action will become lower than some other action estimates (these might still be very optimistic), less optimal actions will be chosen more. (Causing the dip after the peak.) This effect seems to fade out quite fast.
# Exercise 2.7: Unbiased Constant-Step-Size Trick
*Carry out an analysis like that in (2.6) to show that Qn is an exponential recency-weighted
average without initial bias.*
<!-- $$\begin{aligned}
\bar{o}_n = & \; \bar{o}_{n-1} + \alpha(1-\bar{o}_{n-1}) \\
= & \; \alpha + (1-\alpha)\bar{o}_{n-1} \\
= & \; \alpha + (1-\alpha)\left[\alpha + (1-\alpha)\bar{o}_{n-2}\right] \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\bar{o}_{n-2} \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\left[\alpha + (1-\alpha)\bar{o}_{n-3}\right] \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\alpha+(1-\alpha)^3\bar{o}_{n-3} \\
= & \; \alpha + (1-\alpha)\alpha+(1-\alpha)^2\alpha+\dots+
(1-\alpha)^{n-1}\alpha+(1-\alpha)^{n}\bar{o}_{0} \\
= & \; \sum_{i=0}^{n-1}{\alpha(1-\alpha)^i} \\
= & \; \alpha\sum_{i=0}^{n-1}{(1-\alpha)^i} \\
\end{aligned}$$
$$\begin{aligned}
\beta_n\doteq & \; \frac{\alpha}{\bar{o}_n} \\
= & \; \frac{\alpha}{\alpha\sum_{i=0}^{n-1}{(1-\alpha)^i}} \\
= & \; \frac{1}{\sum_{i=0}^{n-1}{(1-\alpha)^i}} \\
\end{aligned}$$ -->
$$\begin{aligned}
Q_{n+1} = & \; Q_n + \beta_n \left[R_n - Q_n\right] \\
= & \; \beta_n R_n+\left(1-\beta_n\right)Q_n \\
= & \; \frac{\alpha}{\bar{o}_n} R_n+\left(1-\frac{\alpha}{\bar{o}_n}\right)Q_n \\
\Rightarrow \bar{o}_n Q_{n+1} = & \; \alpha R_n+\left(\bar{o}_n-\alpha\right)Q_n \\
= & \; \alpha R_n+\left(\alpha + (1-\alpha)\bar{o}_{n-1}-\alpha\right)Q_n \\
= & \; \alpha R_n+\left(1-\alpha\right)\bar{o}_{n-1}Q_n \\
= & \; \alpha R_n+\left(1-\alpha\right)\left[\alpha R_n+(1-\alpha)\bar{o}_{n-2}Q_{n-1}\right] \\
= & \; \alpha R_n+\left(1-\alpha\right)\alpha R_n+(1-\alpha)^2\bar{o}_{n-2}Q_{n-1} \\
= & \; \left[\sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i}\right]+(1-\alpha)^n\bar{o}_0Q_1 \\
= & \; \sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i} \\
\Rightarrow Q_{n+1} = & \; \beta_n \sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i}
\end{aligned}$$
As this expression does not depend on $Q_1$ there should be no initial bias.
# Exercise 2.8: UCB Spikes
*In Figure 2.4 the UCB algorithm shows a distinct spike
in performance on the 11th step. Why is this? Note that for your answer to be fully
satisfactory it must explain both why the reward increases on the 11th step and why it
decreases on the subsequent steps.
Hint: If c = 1, then the spike is less prominent.*
The graph was generated with $c=2$, this value is also used in this example.
The amount of actions $|A| = 10$. The highest average action reward $q_* \approxeq 1.55$. (Gaussian distribution.)
As written on page 36: *If $N_t(a) = 0$, then a is considered to be a maximizing action.*
This means that up to timestep 10 all the different actions will be explored, there won't be any "greedy" selection for one of the better actions.
We can calculate the second term in equation $2.10$ for different timesteps.
$$\begin{aligned}
& A_t \doteq \argmax_a\left[ Q_t(a)+c\sqrt{\frac{\ln t}{N_t(a)}} \; \right] \; (2.10) \\
& t = 10 \Rightarrow c\sqrt{\frac{\ln t}{N_t(a)}} = 2 \sqrt{\ln{10}} \approxeq 3.03
\end{aligned}$$
At timestep 10 all the actions will have been explored exactly once. Their second terms $c\sqrt{\frac{\ln t}{N_t(a)}}$ will thus all have the same value. Since the action selection maximizes equation 2.10, only the term $Q_t(a)$ will decide what action is chosen. The action with the highest single reward will thus always be chosen at timestep 11.
For the action that has been chosen twice, the second term will now be:
$$
t = 11 \Rightarrow c\sqrt{\frac{\ln t}{N_t(a)}} = 2 \sqrt{\frac{\ln{11}}{2}} \approxeq 2.19
$$
While for the other actions that have been chosen only once, the term will be:
$$
t = 11 \Rightarrow c\sqrt{\frac{\ln t}{N_t(a)}} = 2 \sqrt{\ln{11}} \approxeq 3.10
$$
All of the non-optimal values will thus be preferred from the 12th timestep up until around the 20th. (Causing the dip.)
We can conclude that with high values of $c$ the exploration term will dictate the actions chosen during the earlier timesteps. If the value of $c$ was chosen lower in the previous example, we would expect the values of $Q_t(a)$ to be the same order of magnitude as the exploration term during the earliest timesteps, thus making the peak and dip less prominent.
# Exercise 2.9
*Show that in the case of two actions, the soft-max distribution is the same
as that given by the logistic, or sigmoid, function often used in statistics and artificial
neural networks.*
The sigmoid function:
$$
S(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}
$$
In the case of two actions, $a$ and $b$:
$$\begin{aligned}
\pi_t(a)&=\frac{e^{H_t(a)}}{e^{H_t(a)}+e^{H_t(b)}} \\
&= \frac{e^{H_t(a)-H_t(b)}}{e^{H_t(a)-H_t(b)}+1} \\
\Rightarrow \pi_t(x)&= \frac{e^{x}}{e^{x}+1} \; | \; x=H_t(a)-H_t(b) \\
\end{aligned}$$
# Exercise 2.10
*Suppose you face a 2-armed bandit task whose true action values change
randomly from time step to time step. Specifically, suppose that, for any time step, the
true values of actions 1 and 2 are respectively 0.1 and 0.2 with probability 0.5 (case A),
and 0.9 and 0.8 with probability 0.5 (case B). If you are not able to tell which case you
face at any step, what is the best expectation of success you can achieve and how should
you behave to achieve it? Now suppose that on each step you are told whether you are
facing case A or case B (although you still dont know the true action values). This is an
associative search task. What is the best expectation of success you can achieve in this
task, and how should you behave to achieve it?*
The average reward is the same for both arms (0.5, both cases combined). While not knowing what case we face, the best course of action would be to select a random action and get an average reward of around 0.5.
If we know what case we are facing, it's possible to split the problem into two separate multi armed bandit problems. This means that for case A we'd choose action 2, resulting in a value of 0.2. In cas B we'd go for action 1, resulting in a value of 0.9. Optimally, we would be able to get an expected return of $0.5 \cdot(0.2+0.9)=0.55$.
# Exercise 2.11 (programming)