From cdd613622dbeae5046b4a7ebd60058a2aad28b41 Mon Sep 17 00:00:00 2001 From: Bart Moyaers Date: Wed, 4 Mar 2020 11:16:17 +0100 Subject: [PATCH] add exercise 3.11 --- exercises.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/exercises.md b/exercises.md index b8a95b9..1e15a6e 100644 --- a/exercises.md +++ b/exercises.md @@ -26,6 +26,7 @@ - [Exercise 3.8](#exercise-38) - [Exercise 3.9](#exercise-39) - [Exercise 3.10](#exercise-310) +- [Exercise 3.11](#exercise-311) # Exercise 1.1: Self-Play *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?* @@ -417,9 +418,15 @@ $$\begin{aligned} G_t & = 1 + \gamma + \gamma^2 + \dots \\ & = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\ & = 1 + \gamma(G_t) \\ -\Rightarrow G_t - \gamma G_t & = 1 \\ -\Rightarrow (1-\gamma)G_t &= 1 \\ +&\Rightarrow G_t - \gamma G_t = 1 \\ +&\Rightarrow (1-\gamma)G_t = 1 \\ \Rightarrow G_t &= \frac{1}{1-\gamma} \end{aligned}$$ +# Exercise 3.11 +*If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?* + +$$ +\mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a) +$$