add exercise 3.11

This commit is contained in:
Bart Moyaers
2020-03-04 11:16:17 +01:00
parent d2e2f3b94e
commit cdd613622d

View File

@@ -26,6 +26,7 @@
- [Exercise 3.8](#exercise-38)
- [Exercise 3.9](#exercise-39)
- [Exercise 3.10](#exercise-310)
- [Exercise 3.11](#exercise-311)
# Exercise 1.1: Self-Play
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -417,9 +418,15 @@ $$\begin{aligned}
G_t & = 1 + \gamma + \gamma^2 + \dots \\
& = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
& = 1 + \gamma(G_t) \\
\Rightarrow G_t - \gamma G_t & = 1 \\
\Rightarrow (1-\gamma)G_t &= 1 \\
&\Rightarrow G_t - \gamma G_t = 1 \\
&\Rightarrow (1-\gamma)G_t = 1 \\
\Rightarrow G_t &= \frac{1}{1-\gamma}
\end{aligned}$$
# Exercise 3.11
*If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?*
$$
\mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a)
$$