From d2e2f3b94e301eeff8ba6761309fa1b0004c0014 Mon Sep 17 00:00:00 2001
From: Bart Moyaers <bart.moyaers@kuleuven.be>
Date: Tue, 3 Mar 2020 16:39:22 +0100
Subject: [PATCH] more exercises

---
 exercises.md | 166 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 163 insertions(+), 3 deletions(-)

diff --git a/exercises.md b/exercises.md
index 6185906..b8a95b9 100644
--- a/exercises.md
+++ b/exercises.md
@@ -9,13 +9,23 @@
 - [Exercise 2.2](#exercise-22)
 - [Exercise 2.3](#exercise-23)
 - [Exercise 2.4](#exercise-24)
-- [Exercise 2.5](#exercise-25)
+- [Exercise 2.5 (programming)](#exercise-25-programming)
 - [Exercise 2.6: Mysterious Spikes](#exercise-26-mysterious-spikes)
 - [Exercise 2.7: Unbiased Constant-Step-Size Trick](#exercise-27-unbiased-constant-step-size-trick)
 - [Exercise 2.8: UCB Spikes](#exercise-28-ucb-spikes)
 - [Exercise 2.9](#exercise-29)
 - [Exercise 2.10](#exercise-210)
 - [Exercise 2.11 (programming)](#exercise-211-programming)
+- [Exercise 3.1](#exercise-31)
+- [Exercise 3.2](#exercise-32)
+- [Exercise 3.3](#exercise-33)
+- [Exercise 3.4](#exercise-34)
+- [Exercise 3.5](#exercise-35)
+- [Exercise 3.6](#exercise-36)
+- [Exercise 3.7](#exercise-37)
+- [Exercise 3.8](#exercise-38)
+- [Exercise 3.9](#exercise-39)
+- [Exercise 3.10](#exercise-310)
 
 # Exercise 1.1: Self-Play
 *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -146,7 +156,7 @@ Q_{n+1} = & \; Q_n + \alpha_n \left[R_n - Q_n\right] \\
 = & \; \sum_{i=1}^{n}\left[R_i\alpha_i\prod_{j=i+1}^{n}(1-\alpha_{j})\right] + Q_1\prod_{i=1}^n(1-\alpha_i)
 \end{aligned}$$
 
-# Exercise 2.5
+# Exercise 2.5 (programming)
 See [exercise_2-5.py](./exercise_2-5.py) for code.
 ![](./exercise_2-5.png)
 
@@ -193,7 +203,7 @@ Q_{n+1} = & \; Q_n + \beta_n \left[R_n - Q_n\right] \\
 = & \; \alpha R_n+\left(1-\alpha\right)\alpha R_n+(1-\alpha)^2\bar{o}_{n-2}Q_{n-1} \\
 = & \; \left[\sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i}\right]+(1-\alpha)^n\bar{o}_0Q_1 \\
 = & \; \sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i} \\
-\Rightarrow Q_{n+1} = & \; \beta_n \sum_{i=0}^{n-1}{\alpha R_n(1-\alpha)^i}
+\Rightarrow Q_{n+1} = & \; \beta_n \sum_{i=0}^{n-1}{ R_n(1-\alpha)^i}
 \end{aligned}$$
 As this expression does not depend on $Q_1$ there should be no initial bias.
 
@@ -263,3 +273,153 @@ The average reward is the same for both arms (0.5, both cases combined). While n
 If we know what case we are facing, it's possible to split the problem into two separate multi armed bandit problems. This means that for case A we'd choose action 2, resulting in a value of 0.2. In cas B we'd go for action 1, resulting in a value of 0.9. Optimally, we would be able to get an expected return of $0.5 \cdot(0.2+0.9)=0.55$.
 
 # Exercise 2.11 (programming)
+#TODO
+
+# Exercise 3.1
+*Devise three example tasks of your own that fit into the MDP framework,
+identifying for each its states, actions, and rewards. Make the three examples as different
+from each other as possible. The framework is abstract and flexible and can be applied in
+many different ways. Stretch its limits in some way in at least one of your examples.*
+
+1) Little quadruped robot that needs to run around.  
+   * Actions: joint torques or positions.
+   * State: robot position and joint positions.
+   * Reward: how far the robot has run.
+2) Stock market trading machine.  
+   * Actions: buying and selling stock
+   * State: current stock market prices indexes, etc...
+   * Reward: increase in value of the portfolio.
+3) Roof window controller.  
+   * Actions: opening or closing the window by a certain amount of degrees.
+   * State: current room temperature, humidity, rain sensor
+   * Reward: deviation on temperature setpoint + strong negative reward if open more than x degrees when raining.
+
+# Exercise 3.2
+*Is the MDP framework adequate to usefully represent all goal-directed
+learning tasks? Can you think of any clear exceptions?*
+
+#TODO
+
+# Exercise 3.3
+*Consider the problem of driving. You could define the actions in terms of
+the accelerator, steering wheel, and brake, that is, where your body meets the machine.
+Or you could define them farther out—say, where the rubber meets the road, considering
+your actions to be tire torques. Or you could define them farther in—say, where your
+brain meets your body, the actions being muscle twitches to control your limbs. Or you
+could go to a really high level and say that your actions are your choices of where to drive.
+What is the right level, the right place to draw the line between agent and environment?
+On what basis is one location of the line to be preferred over another? Is there any
+fundamental reason for preferring one location over another, or is it a free choice? *
+
+It depends on what kind of agent you want to get. If you want to create an agent that can physically drive a car using muscles, you will have to go pretty low-level. While if you would like to create an agent able to go from point a to point b the actions will be on a higher level.
+
+# Exercise 3.4
+*Give a table analogous to that in Example 3.3, but for $p(s'
+, r|s, a)$. It
+should have columns for $s, a, s'
+, r$, and $p(s'
+, r|s, a)$, and a row for every 4-tuple for which
+$p(s', r|s, a) > 0$.*
+
+| $s$  | $a$      | $s'$ | $r$          | $p(s', r\vert s,a)$ |
+| ---- | -------- | ---- | ------------ | ------------------- |
+| high | search   | high | $r_{search}$ | $\alpha$            |
+| high | search   | low  | $r_{search}$ | $1-\alpha$          |
+| high | wait     | high | $r_{wait}$   | $1$                 |
+| low  | search   | high | $-3$         | $1-\beta$           |
+| low  | search   | low  | $r_{search}$ | $\beta$             |
+| low  | wait     | low  | $r_{wait}$   | $1$                 |
+| low  | recharge | high | $0$          | $1$                 |
+
+# Exercise 3.5
+*The equations in Section 3.1 are for the continuing case and need to be
+modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).*
+
+Original:  
+$$\sum_{s'\in S}\sum_{r\in R}p(s', r \vert s, a)=1, for \; all \; s\in S, a\in A(s)$$
+
+Since we now need to differ between all the nonterminal states $S$ and the set including the terminal states $S^+$ this becomes:
+
+$$
+\sum_{s'\in S^+}\sum_{r\in R}p(s', r \vert s, a)=1, for \; all \; s\in S, a\in A(s)
+$$
+
+Since the new state $s'$ can now also be a terminal state.
+
+# Exercise 3.6
+*Suppose you treated pole-balancing as an episodic task but also used
+discounting, with all rewards zero except for -1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing
+formulation of this task?*
+
+Episodic, discounted:
+$$
+G_t=-\gamma^{T-1}
+$$
+
+Continuous, discounted. Suppose failure at time step $K$ and recovery never happens.
+
+Given $t \geq K$:
+$$\begin{aligned}
+G_t & = - 1 - \gamma - \gamma^2 \dots \\
+&= -1\sum_{k=0}^{\infty}\gamma^k \\ 
+&= \frac{-1}{1-\gamma}
+\end{aligned}$$
+
+Given $t < K$:
+$$\begin{aligned}
+G_t & = -\gamma^{K-t}(1+\gamma+\gamma^2+\dots) \\
+& = \frac{\gamma^{K-t}}{1-\gamma} \\
+\end{aligned}$$
+
+# Exercise 3.7
+*Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing
+no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?*
+
+Does the episode terminate after a certain max amount of timesteps?
+Even if that's the case, the reward for the agent is very sparse. The agent needs to be able to complete the maze in order to get any reward. In all other cases the agent will get a reward of 0. In other words: the agent doesn't learn to complete mazes, because it will only get a reward if it can already complete the maze. (No further improvement of reward is possible.) Perhaps a negative reward for every time step spent in the maze would help the agent learn.
+
+# Exercise 3.8
+*Suppose $\gamma = 0.5$ and the following sequence of rewards is received $R_1=-1, R_2=2, R_3=6, R_4=3$, and $R_5=2$, with $T=5$. What are $G_0, G_1, \dots, G_5$? Hint: Work backwards.*
+
+$$\begin{aligned}
+G_5 &= 0 \\
+G_4 &= R_5 = 2 \\
+G_3 &= R_4 + \gamma G_4 = 4 \\
+G_2 &= R_3 + \gamma G_3 = 8 \\
+G_1 &= R_2 + \gamma G_2 = 6 \\
+G_0 &= R_1 + \gamma G_1 = 2 \\
+\end{aligned}$$
+
+# Exercise 3.9
+*Suppose $\gamma=0.9$ and the reward sequence is $R_1=2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?*
+
+$$\begin{aligned}
+G_1 &= 7 + 7\gamma + 7\gamma^2 \dots \\
+&= 7(1+\gamma+\gamma^2\dots) \\
+&= \frac{7}{1-\gamma} = 70
+\end{aligned}$$
+
+$$\begin{aligned}
+G_0 &= 2 + \gamma G_1 \\
+&= 65
+\end{aligned}$$
+
+
+# Exercise 3.10
+*Prove the second equality in (3.10).*
+
+Proving:
+$$
+\sum_{k=0}^{\infty}\gamma^k = \frac{1}{1-\gamma}
+$$
+
+$$\begin{aligned}
+G_t & = 1 + \gamma + \gamma^2 + \dots \\
+& = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
+& = 1 + \gamma(G_t) \\
+\Rightarrow G_t - \gamma G_t & = 1 \\
+\Rightarrow (1-\gamma)G_t &= 1 \\
+\Rightarrow G_t &= \frac{1}{1-\gamma}
+\end{aligned}$$
+
+