add exercise 3.11

2020-03-04 11:16:17 +01:00
parent d2e2f3b94e
commit cdd613622d
1 changed files with 9 additions and 2 deletions
--- a/exercises.md
+++ b/exercises.md
@@ -26,6 +26,7 @@
 - [Exercise 3.8](#exercise-38)
 - [Exercise 3.9](#exercise-39)
 - [Exercise 3.10](#exercise-310)
+- [Exercise 3.11](#exercise-311)

 # Exercise 1.1: Self-Play
 *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -417,9 +418,15 @@ $$\begin{aligned}
 G_t & = 1 + \gamma + \gamma^2 + \dots \\
 & = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
 & = 1 + \gamma(G_t) \\
-\Rightarrow G_t - \gamma G_t & = 1 \\
-\Rightarrow (1-\gamma)G_t &= 1 \\
+&\Rightarrow G_t - \gamma G_t = 1 \\
+&\Rightarrow (1-\gamma)G_t = 1 \\
 \Rightarrow G_t &= \frac{1}{1-\gamma}
 \end{aligned}$$

+# Exercise 3.11
+*If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?*
+
+$$
+\mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a)
+$$