add exercise 3.11
This commit is contained in:
11
exercises.md
11
exercises.md
@@ -26,6 +26,7 @@
|
||||
- [Exercise 3.8](#exercise-38)
|
||||
- [Exercise 3.9](#exercise-39)
|
||||
- [Exercise 3.10](#exercise-310)
|
||||
- [Exercise 3.11](#exercise-311)
|
||||
|
||||
# Exercise 1.1: Self-Play
|
||||
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
|
||||
@@ -417,9 +418,15 @@ $$\begin{aligned}
|
||||
G_t & = 1 + \gamma + \gamma^2 + \dots \\
|
||||
& = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
|
||||
& = 1 + \gamma(G_t) \\
|
||||
\Rightarrow G_t - \gamma G_t & = 1 \\
|
||||
\Rightarrow (1-\gamma)G_t &= 1 \\
|
||||
&\Rightarrow G_t - \gamma G_t = 1 \\
|
||||
&\Rightarrow (1-\gamma)G_t = 1 \\
|
||||
\Rightarrow G_t &= \frac{1}{1-\gamma}
|
||||
\end{aligned}$$
|
||||
|
||||
# Exercise 3.11
|
||||
*If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?*
|
||||
|
||||
$$
|
||||
\mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a)
|
||||
$$
|
||||
|
||||
|
||||
Reference in New Issue
Block a user