diff --git a/exercises.md b/exercises.md index f891082..3b4ea51 100644 --- a/exercises.md +++ b/exercises.md @@ -45,6 +45,11 @@ - [Exercise 3.27](#exercise-327) - [Exercise 3.28](#exercise-328) - [Exercise 3.29](#exercise-329) +- [Exercise 4.1](#exercise-41) +- [Exercise 4.2](#exercise-42) + - [Rest of dynamics unchanged](#rest-of-dynamics-unchanged) + - [Dynamics of 13 changed](#dynamics-of-13-changed) +- [Exercise 4.3](#exercise-43) # Exercise 1.1: Self-Play *Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?* @@ -790,3 +795,50 @@ v_*(s) = \max_a \sum_{s'} p(s'|s,a)[r(s,a)+\gamma v_{*}(s')] \\ TODO $$ +# Exercise 4.1 +*In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_{\pi}(11, \textrm{down})$? +What is $q_{\pi}(7, \textrm{down})$?* + +$$ +q_{\pi}(11, \textrm{down}) = -1 +$$ + +$$ +q_{\pi}(7, \textrm{down}) = -1 -14 = -15 +$$ +(See figure 4.1, the value of square 11 is -14.) + +# Exercise 4.2 +*In Example 4.1, suppose a new state 15 is added to the gridworld just below +state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is $v_{\pi}(15)$ for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is $v_{\pi}(15)$ for the equiprobable random policy in this case?* + +## Rest of dynamics unchanged +From figure 4.1 we can see: +$$ +v_{\pi}(12) = -22 \\ +v_{\pi}(13) = -20 \\ +v_{\pi}(14) = -14 \\ +$$ + +Using 4.4 we can now calculate $v_{\pi}(15)$: +$$ \begin{aligned} +&\pi(\textrm{left}|15) = \pi(\textrm{right}|15) = \pi(\textrm{up}|15) = \pi(\textrm{down}|15) = 0.25 \\ +& p(s', r|s,a) = 1 \; \forall \; s', r \in S', R\\ +&|S'| = 1 \; \textrm{and} \; |R| = 1\\ +\end{aligned}$$ +$$\begin{aligned} +v_{\pi}(15) = &\; 0.25 (-1+0.9\cdot-22) \\ +& + 0.25 (-1+0.9\cdot-20) \\ +& + 0.25 (-1+0.9\cdot-14) \\ +& + 0.25 (-1+0.9 \, v_{\pi}(15)) \\ +\Rightarrow v_{\pi}(15) = & \; -\frac{544}{31} \approx -17.55 +\end{aligned}$$ + +## Dynamics of 13 changed +TODO: I guess this would require updating the values of the whole grid. + +# Exercise 4.3 +*What are the equations analogous to (4.3), (4.4), and (4.5) for the action-value function $q_{\pi}$ and its successive approximation by a sequence of functions $q_0, q_1, q_2,\dots$?* + +TODO +