more exercises

2020-03-26 08:55:49 +01:00
parent 04d7b9775d
commit d6b617ed64
1 changed files with 52 additions and 0 deletions
--- a/exercises.md
+++ b/exercises.md
@@ -45,6 +45,11 @@
 - [Exercise 3.27](#exercise-327)
 - [Exercise 3.28](#exercise-328)
 - [Exercise 3.29](#exercise-329)
 - [Exercise 4.1](#exercise-41)
 - [Exercise 4.2](#exercise-42)
  - [Rest of dynamics unchanged](#rest-of-dynamics-unchanged)
  - [Dynamics of 13 changed](#dynamics-of-13-changed)
 - [Exercise 4.3](#exercise-43)
 # Exercise 1.1: Self-Play
 *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
@@ -790,3 +795,50 @@ v_*(s) = \max_a \sum_{s'} p(s'|s,a)[r(s,a)+\gamma v_{*}(s')] \\
 TODO
 $$
 # Exercise 4.1
 *In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_{\pi}(11, \textrm{down})$?  
 What is $q_{\pi}(7, \textrm{down})$?*
 $$
 q_{\pi}(11, \textrm{down}) = -1
 $$
 $$
 q_{\pi}(7, \textrm{down}) = -1 -14 = -15
 $$
 (See figure 4.1, the value of square 11 is -14.)
 # Exercise 4.2
 *In Example 4.1, suppose a new state 15 is added to the gridworld just below
 state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is $v_{\pi}(15)$ for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is $v_{\pi}(15)$ for the equiprobable random policy in this case?*
 ## Rest of dynamics unchanged
 From figure 4.1 we can see:
 $$
 v_{\pi}(12) = -22 \\
 v_{\pi}(13) = -20 \\
 v_{\pi}(14) = -14 \\
 $$
 Using 4.4 we can now calculate $v_{\pi}(15)$:
 $$ \begin{aligned}
 &\pi(\textrm{left}|15) = \pi(\textrm{right}|15) = \pi(\textrm{up}|15) = \pi(\textrm{down}|15) = 0.25 \\
 & p(s', r|s,a) = 1 \; \forall \; s', r \in S', R\\
 &|S'| = 1 \; \textrm{and} \; |R| = 1\\
 \end{aligned}$$
 $$\begin{aligned}
 v_{\pi}(15) = &\; 0.25 (-1+0.9\cdot-22) \\
 & + 0.25 (-1+0.9\cdot-20) \\
 & + 0.25 (-1+0.9\cdot-14) \\
 & + 0.25 (-1+0.9 \, v_{\pi}(15)) \\
 \Rightarrow v_{\pi}(15) = & \; -\frac{544}{31} \approx -17.55
 \end{aligned}$$
 ## Dynamics of 13 changed
 TODO: I guess this would require updating the values of the whole grid.
 # Exercise 4.3
 *What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function $q_{\pi}$ and its successive approximation by a sequence of functions $q_0, q_1, q_2,\dots$?*
 TODO