From 57e132deeab1f950a587d30f8c57d0928147dfc9 Mon Sep 17 00:00:00 2001
From: Bart Moyaers <bart.moyaers@gmail.com>
Date: Thu, 5 Mar 2020 17:42:42 +0100
Subject: [PATCH] add more exercises

---
 exercises.md | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/exercises.md b/exercises.md
index 1e15a6e..8feb47f 100644
--- a/exercises.md
+++ b/exercises.md
@@ -27,6 +27,11 @@
 - [Exercise 3.9](#exercise-39)
 - [Exercise 3.10](#exercise-310)
 - [Exercise 3.11](#exercise-311)
+- [Exercise 3.12](#exercise-312)
+- [Exercise 3.13](#exercise-313)
+- [Exercise 3.14](#exercise-314)
+- [Exercise 3.15](#exercise-315)
+- [Exercise 3.16](#exercise-316)
 
 # Exercise 1.1: Self-Play
 *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -430,3 +435,54 @@ $$
 \mathbb{E}[R_{t+1}] = \sum_{r\in R(S_t)} \sum_{a\in A(S_t)}r\,\pi(a\vert S_t)\,p(s',\, r\vert S_t, a)
 $$
 
+# Exercise 3.12
+*Give an equation for $v_{\pi}$ in terms of $q_{\pi}$ and $\pi$.*
+
+$$
+v_{\pi} = \sum_{a \in A(s)}\pi(a\vert s)\, q_{\pi}(s, a)
+$$
+
+# Exercise 3.13
+*Give an equation for $q_{\pi}$ in terms of $v_{\pi}$ and the four-argument $p$.*
+
+$$
+q_{\pi}(s,a) = \sum_{s', r}p(s', r\vert s,a)\, v_{\pi}(s')
+$$
+
+# Exercise 3.14
+*The Bellman equation (3.14) must hold for each state for the value function
+$v_{\pi}$ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, 0.4, and +0.7. (These numbers are accurate only to one decimal place.)*
+
+$s$ being the middle:
+
+| $a$   | $\pi(a\vert s)$ | $p(s',r\vert s,a)$ | $r$ | $v_{\pi}(s')$ | Result    |
+| ----- | --------------- | ------------------ | --- | ------------- | --------- |
+| north | 0.25            | 1                  | 0   | 2.3           | 0.5175    |
+| south | 0.25            | 1                  | 0   | -0.4          | -0.09     |
+| east  | 0.25            | 1                  | 0   | 0.4           | 0.09      |
+| west  | 0.25            | 1                  | 0   | 0.7           | 0.1575    |
+|       |                 |                    |     | **Sum:**      | **0.675** |
+
+# Exercise 3.15
+1) *In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant, $v_c$, to the values of all states, and thus does not affect the relative values of any states under any policies.*
+
+$$\begin{aligned}
+G_t &= R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+13}+\dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\\
+\Rightarrow \; G_{tc} &= R_{t+1}+c+\gamma (R_{t+2}+c)+\gamma^2 (R_{t+13}+c)+\dots \\
+&= \sum_{k=0}^{\infty} \gamma^k(R_{t+k+1}+c) \\
+&= \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} + \sum_{k=0}^{\infty} \gamma^kc \\
+&= G_t + v_c \\
+\Rightarrow v_{\pi c}(s) &= \mathbb{E}_{\pi}\left[G_{tc} \vert S_t = s\right] \\
+&= \mathbb{E}_{\pi}\left[G_{t}+v_c\, \vert \, S_t = s\right] \\
+&= \mathbb{E}_{\pi}\left[G_{t}\, \vert \, S_t = s\right] + v_c
+\end{aligned}$$
+
+2) *What is $v_c$ in terms of $c$ and $\gamma$?*
+
+$$
+v_c = \sum_{k=0}^{\infty} \gamma^kc
+$$
+
+# Exercise 3.16
+*Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.*
+