more exercises

2020-03-12 15:02:36 +01:00
parent 8a4666d008
commit 04d7b9775d
1 changed files with 256 additions and 1 deletions
--- a/exercises.md
+++ b/exercises.md
@@ -36,6 +36,15 @@
 - [Exercise 3.18](#exercise-318)
 - [Exercise 3.19](#exercise-319)
 - [Exercise 3.20](#exercise-320)
 - [Exercise 3.21](#exercise-321)
 - [Exercise 3.22](#exercise-322)
 - [Exercise 3.23](#exercise-323)
 - [Exercise 3.24](#exercise-324)
 - [Exercise 3.25](#exercise-325)
 - [Exercise 3.26](#exercise-326)
 - [Exercise 3.27](#exercise-327)
 - [Exercise 3.28](#exercise-328)
 - [Exercise 3.29](#exercise-329)
 # Exercise 1.1: Self-Play
 *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -423,6 +432,7 @@ $$
 \sum_{k=0}^{\infty}\gamma^k = \frac{1}{1-\gamma}
 $$
 Proof:
 $$\begin{aligned}
 G_t & = 1 + \gamma + \gamma^2 + \dots \\
 & = 1 + \gamma(1+\gamma+\gamma^2+\dots) \\
@@ -530,8 +540,253 @@ $$\begin{aligned}
 q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
 &= \mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1}) \vert S_t=s, A_t=a] \\
 &= \sum_{s',r} p(s',r\vert s,a)\left[r + \gamma v_{\pi}(s') \right] \\
-
+&= \sum_{s',r} p(s',r\vert s,a)\left[r + \gamma \sum_{a\in A(s')} \pi(a\vert s')\, q_{\pi}(s',a) \right] \\
 \end{aligned}$$
 # Exercise 3.20
 *Draw or describe the optimal state-value function for the golf example.*
 The best course of action is:
 $driver \rightarrow driver \rightarrow putter$
 The optimal state-value funtion $v_*(s)$ would then look like the lower part of figure 3.3 for gions outside of the green area. Within the green area, it would look like the upper part of figure 3.3.
 # Exercise 3.21
 *Draw or describe the contours of the optimal action-value function for putting, $q_*(s, putter)$, for the golf example.*
 The first line with value -4 would be the same as the -6 line in the upper part of figure 3.3. Then there would be two larger regions of value -3 and -2, where the driver would be used. Until the green area is reached: with value -1. From there on the putter is used again.
 # Exercise 3.22
 *See book.*
 Let's calculate the values $v_{\pi left}$ and $v_{\pi right}$ for the different values of $\gamma$ (using 3.12):
 For $v_{\pi left}$ and $v_{\pi right}$:
 $$
 \pi (a|s) = 1
 $$ 
 and 
 $$
 p(s',r|s,a) = 1
 $$
 So:
 $$
 v_{\pi}(s) = \sum_{a,s',r} [r+\gamma v_{\pi}(s')]
 $$
 Because there's only one $(a,s',r)$ triple that is valid for each policy:
 $$\begin{aligned}
 v_{\pi}(s) &= r+\gamma v_{\pi}(s') \\
 &= r+\gamma (r' + \gamma v_{\pi}(s'')) \\
 &= \sum_{k=0}^{\infty}\gamma^k R_{k+t+1} \\
 \end{aligned}$$
 (Identical to 3.8.)
 >$\gamma = 0$  
 >$v_{\pi left}(s) = 1$  
 >$v_{\pi right}(s) = 0$  
 >$\Rightarrow v_*(s) = v_{\pi left}(s) = 1$  
 >$\gamma = 0.9$  
 >$v_{\pi left}(s) = 1 + 0.9 \cdot 0 + 0.9^2 \cdot 1 + 0.9^3\cdot 0 + \dots$  
 >$v_{\pi right}(s) = 0 + 0.9 \cdot 2 + 0.9^2 \cdot 0 + 0.9^3 \cdot 2 + \dots$  
 > We can see that:  
 > $0.9^k < 2\cdot 0.9^{k+1}$  
 >$\Rightarrow v_*(s) = v_{\pi right}(s)$  
 >$\gamma = 0.5$  
 >$v_{\pi left}(s) = 1 + 0.5 \cdot 0 + 0.5^2 \cdot 1 + 0.5^3\cdot 0 + \dots$  
 >$v_{\pi right}(s) = 0 + 0.5 \cdot 2 + 0.5^2 \cdot 0 + 0.5^3 \cdot 2 + \dots$   
 > We can see that:  
 > $0.5^k = 2\cdot 0.5^{k+1}$  
 >$\Rightarrow v_*(s) = v_{\pi right}(s) = v_{\pi left}(s)$  
 # Exercise 3.23
 *Give the Bellman equation for $q_*$ for the recycling robot.*
 General equation:
 $$
 q_*(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma \max_{a'} q_*(s',a')]
 $$
 Possible state-action set (from [exercise 3.4](#exercise-34)):
 | $s$ | $a$ | $s'$ | $r$          | $p(s', r\vert s,a)$ |
 | --- | --- | ---- | ------------ | ------------------- |
 | h   | s   | h    | $r_{search}$ | $\alpha$            |
 | h   | s   | l    | $r_{search}$ | $1-\alpha$          |
 | h   | w   | h    | $r_{wait}$   | $1$                 |
 | l   | s   | h    | $-3$         | $1-\beta$           |
 | l   | s   | l    | $r_{search}$ | $\beta$             |
 | l   | w   | l    | $r_{wait}$   | $1$                 |
 | l   | r   | h    | $0$          | $1$                 |
 $$
 \begin{aligned}
 q_*(\textrm{h, s}) &= \alpha [r_{search} + \gamma \max_{a'}q_*(\textrm{h},a')] \\
 & + (1-\alpha)[r_{search} + \gamma \max_{a'}q_*(\textrm{h},a')]\\
 &= \alpha \left[r_{search} + \gamma \max 
 \begin{cases}
   q_*(\textrm{h, s})\\
   q_*(\textrm{h, w})
 \end{cases}
 \right] \\
 & + (1-\alpha)\left[r_{search} + \gamma \max
 \begin{cases}
   q_*(\textrm{h, s})\\
   q_*(\textrm{h, w})
 \end{cases}
 \right] \\
 &= r_{search} + \gamma \max
 \begin{cases}
   q_*(\textrm{h, s})\\
   q_*(\textrm{h, w})
 \end{cases} & (1) \\
 q_*(\textrm{h, w}) &= r_{wait} + \gamma \max 
 \begin{cases}
   q_*(\textrm{h, s})\\
   q_*(\textrm{h, w})
 \end{cases} & (2) \\
 q_*(\textrm{l, s}) &= \beta \left[r_{search} + \gamma \max 
 \begin{cases}
   q_*(\textrm{l, s})\\
   q_*(\textrm{l, w})\\
   q_*(\textrm{l, r})
 \end{cases}
 \right] \\
 & + (1-\beta)\left[-3 + \gamma \max
 \begin{cases}
   q_*(\textrm{h, s})\\
   q_*(\textrm{h, w})
 \end{cases}
 \right] & (3)\\
 q_*(\textrm{l, w}) &=r_{wait} + \gamma \max 
 \begin{cases}
   q_*(\textrm{l, s})\\
   q_*(\textrm{l, w})\\
   q_*(\textrm{l, r})\\
 \end{cases} & (4) \\
 q_*(\textrm{l, r}) &= \gamma \max 
 \begin{cases}
   q_*(\textrm{h, s})\\
   q_*(\textrm{h, w})\\
 \end{cases} & (5)\\
 \end{aligned}
 $$
 $$
 \begin{aligned}
   \begin{cases}
      (1) \\
      (5) \\
   \end{cases}
   & \Rightarrow q_*(\textrm{h, s}) = r_{search} + q_*(\textrm{l,r}) & (6)\\
   \begin{cases}
      (2) \\
      (5) \\
   \end{cases}
   & \Rightarrow q_*(\textrm{h, w}) = r_{wait} + q_*(\textrm{l,r}) & (7) \\
   \begin{cases}
      (6) \\
      (7) \\
   \end{cases}
   & \Rightarrow 
      q_*(\textrm{l, r}) = \gamma
         \left(
            \max
            \begin{cases}
               r_{search} \\
               r_{wait} \\
            \end{cases}
            + q_*(\textrm{l, r})
         \right) \\
   & \Rightarrow
      q_*(\textrm{l, r}) = \frac{\gamma
         \left(
            \max
            \begin{cases}
               r_{search} \\
               r_{wait} \\
            \end{cases}
         \right)}
         {1-\gamma} \\
   & \Rightarrow
      q_*(\textrm{l, r}) = \frac{\gamma}
         {1-\gamma} r_{search} & (8)\\
   \begin{cases}
      (6) \\
      (8) \\
   \end{cases}
   & \Rightarrow q_*(\textrm{h, s}) = r_{search}\frac{1}{1-\gamma} & (9) \\
   \begin{cases}
      (7) \\
      (8) \\
   \end{cases}
   & \Rightarrow q_*(\textrm{h, w}) = r_{wait} + \frac{\gamma}
         {1-\gamma} r_{search} & (10) \\
 \end{aligned}
 $$
 # Exercise 3.24
 *Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this valuu symbolically, and then to compute it to three decimal places.*
 The optimal rewards in that location are of the form (t increasing):
 $$
 (10, 0, 0, 0, 0), (10, 0, 0, 0, 0), \dots
 $$
 $$\begin{aligned}
 G_t &= \sum_{k=0}^{\infty} 10\gamma^{5k} \\
 &= 10 \sum_{k=0}^{\infty} \left(\gamma^5\right)^k \\
 &= 10 \sum_{k=0}^{\infty} \delta^k \\
 &= 10 \frac{1}{1-\delta} \\
 &= 10 \frac{1}{1-\gamma^5}
 = 10 \frac{1}{1-0.9^5} \approx 24.419428
 \end{aligned}$$
 # Exercise 3.25
 *Give an equation for $v_*$ in terms of $q_*$.*
 $$
 v_*(s) = \max_a q_*(s,a)
 $$
 # Exercise 3.26
 *Give an equation for $q_*$ in terms of $v_*$ and the four-argument p.*
 $$
 q_*(s,a) = \sum_{s',r} p(s',r|s,a) \left[r+\gamma v_*(s')\right]
 $$
 # Exercise 3.27
 *Give an equation for $\pi_*$ in terms of $q_*$.*
 $$
 \pi_*(a|s) = \argmax_a q_*(s,a)
 $$
 # Exercise 3.28
 *Give an equation for $\pi_*$in terms of $v_*$ and the four-argument $p$.*
 (See Exercise 3.26.)
 $$
 \pi_*(a|s) = \argmax_a \sum_{s',r} p(s',r|s,a) \left[r+\gamma v_*(s')\right]
 $$
 # Exercise 3.29
 *Rewrite the four Bellman equations for the four value functions $(v_{\pi}, v_*, q_{\pi}, q_*)$ in terms of the three argument function $p$ (3.4) and the two-argument function $r$ (3.5).*
 $$
 p(s'|s,a)  \\
 r(s,a) \\
 v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)[r(s,a)+\gamma v_{\pi}(s')] \\
 v_*(s) = \max_a \sum_{s'} p(s'|s,a)[r(s,a)+\gamma v_{*}(s')] \\
 TODO
 $$