add more exercises

This commit is contained in:
2020-03-06 15:56:17 +01:00
parent 57e132deea
commit 8a4666d008

View File

@@ -32,6 +32,10 @@
- [Exercise 3.14](#exercise-314)
- [Exercise 3.15](#exercise-315)
- [Exercise 3.16](#exercise-316)
- [Exercise 3.17](#exercise-317)
- [Exercise 3.18](#exercise-318)
- [Exercise 3.19](#exercise-319)
- [Exercise 3.20](#exercise-320)
# Exercise 1.1: Self-Play
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -486,3 +490,48 @@ $$
# Exercise 3.16
*Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.*
At first guess: since the episodes do not necessarily have the same length, the offset might have a disproportionate effect.
$$\begin{aligned}
G_t &= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1} \\
G_{tc} &= \sum_{k=0}^{T-t} \gamma^{k} (R_{t+k+1}+c) \\
&= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1}
+ \sum_{k=0}^{T-t} \gamma^{k}c \\
&= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1} + v_c(t,T)
\end{aligned}$$
Since $v_c$ now depends on the start and end time step $t, T$ it's not possible to separate it from the expected value in the definition for the state-value function:
$$\begin{aligned}
v_{\pi c}(s) &= \mathbb{E}_{\pi}\left[G_{tc} \vert S_t = s\right] \\
&= \mathbb{E}_{\pi}\left[G_{t}+v_c(t,T)\, \vert \, S_t = s\right] \\
\end{aligned}$$
In other words: the offset on the expected value depends on the episode length!
# Exercise 3.17
*What is the Bellman equation for action values, that is, for $q_{\pi}$? It must give the action value $q_{\pi}(s, a)$ in terms of the action values, $q_{\pi}(s', a')$, of possible successors to the stateaction pair $(s, a)$. Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.*
$$\begin{aligned}
q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
&= \mathbb{E}[R_{t+1} + \gamma G_{t+1}\vert S_t=s, A_t=a] \\
&= \sum_{s',r}p(s',r\vert s,a)\left(r + \gamma \sum_{a'} \pi(a'\vert s')q_{\pi}(s', a')\right) \\
\end{aligned}$$
# Exercise 3.18
$$\begin{aligned}
v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t\vert S_t=s] \\
&= \sum_{a\in A(s)} \pi(a\vert s)\, q_{\pi}(s,a) \\
\end{aligned}$$
# Exercise 3.19
$$\begin{aligned}
q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
&= \mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1}) \vert S_t=s, A_t=a] \\
&= \sum_{s',r} p(s',r\vert s,a)\left[r + \gamma v_{\pi}(s') \right] \\
\end{aligned}$$
# Exercise 3.20