From cdd613622dbeae5046b4a7ebd60058a2aad28b41 Mon Sep 17 00:00:00 2001
From: Bart Moyaers <bart.moyaers@kuleuven.be>
Date: Wed, 4 Mar 2020 11:16:17 +0100
Subject: [PATCH] add exercise 3.11

---
 exercises.md | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/exercises.md b/exercises.md
index b8a95b9..1e15a6e 100644
--- a/exercises.md
+++ b/exercises.md
@@ -26,6 +26,7 @@
 - [Exercise 3.8](#exercise-38)
 - [Exercise 3.9](#exercise-39)
 - [Exercise 3.10](#exercise-310)
+- [Exercise 3.11](#exercise-311)
 
 # Exercise 1.1: Self-Play
 *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -417,9 +418,15 @@ $$\begin{aligned}
 G_t & = 1 + \gamma + \gamma^2 + \dots \\
 & = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
 & = 1 + \gamma(G_t) \\
-\Rightarrow G_t - \gamma G_t & = 1 \\
-\Rightarrow (1-\gamma)G_t &= 1 \\
+&\Rightarrow G_t - \gamma G_t = 1 \\
+&\Rightarrow (1-\gamma)G_t = 1 \\
 \Rightarrow G_t &= \frac{1}{1-\gamma}
 \end{aligned}$$
 
+# Exercise 3.11
+*If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?*
+
+$$
+\mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a)
+$$