add more exercises
This commit is contained in:
49
exercises.md
49
exercises.md
@@ -32,6 +32,10 @@
|
||||
- [Exercise 3.14](#exercise-314)
|
||||
- [Exercise 3.15](#exercise-315)
|
||||
- [Exercise 3.16](#exercise-316)
|
||||
- [Exercise 3.17](#exercise-317)
|
||||
- [Exercise 3.18](#exercise-318)
|
||||
- [Exercise 3.19](#exercise-319)
|
||||
- [Exercise 3.20](#exercise-320)
|
||||
|
||||
# Exercise 1.1: Self-Play
|
||||
*Suppose, instead of playing against a random opponent, the reinforcement learning algorithm described above played against itself, with both sides learning. What do you think would happen in this case? Would it learn a different policy for selecting moves?*
|
||||
@@ -486,3 +490,48 @@ $$
|
||||
# Exercise 3.16
|
||||
*Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.*
|
||||
|
||||
At first guess: since the episodes do not necessarily have the same length, the offset might have a disproportionate effect.
|
||||
|
||||
$$\begin{aligned}
|
||||
G_t &= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1} \\
|
||||
G_{tc} &= \sum_{k=0}^{T-t} \gamma^{k} (R_{t+k+1}+c) \\
|
||||
&= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1}
|
||||
+ \sum_{k=0}^{T-t} \gamma^{k}c \\
|
||||
&= \sum_{k=0}^{T-t} \gamma^{k} R_{t+k+1} + v_c(t,T)
|
||||
\end{aligned}$$
|
||||
|
||||
Since $v_c$ now depends on the start and end time step $t, T$ it's not possible to separate it from the expected value in the definition for the state-value function:
|
||||
|
||||
$$\begin{aligned}
|
||||
v_{\pi c}(s) &= \mathbb{E}_{\pi}\left[G_{tc} \vert S_t = s\right] \\
|
||||
&= \mathbb{E}_{\pi}\left[G_{t}+v_c(t,T)\, \vert \, S_t = s\right] \\
|
||||
\end{aligned}$$
|
||||
|
||||
In other words: the offset on the expected value depends on the episode length!
|
||||
|
||||
# Exercise 3.17
|
||||
*What is the Bellman equation for action values, that is, for $q_{\pi}$? It must give the action value $q_{\pi}(s, a)$ in terms of the action values, $q_{\pi}(s', a')$, of possible successors to the state–action pair $(s, a)$. Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.*
|
||||
|
||||
$$\begin{aligned}
|
||||
q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
|
||||
&= \mathbb{E}[R_{t+1} + \gamma G_{t+1}\vert S_t=s, A_t=a] \\
|
||||
&= \sum_{s',r}p(s',r\vert s,a)\left(r + \gamma \sum_{a'} \pi(a'\vert s')q_{\pi}(s', a')\right) \\
|
||||
\end{aligned}$$
|
||||
|
||||
# Exercise 3.18
|
||||
|
||||
$$\begin{aligned}
|
||||
v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t\vert S_t=s] \\
|
||||
&= \sum_{a\in A(s)} \pi(a\vert s)\, q_{\pi}(s,a) \\
|
||||
\end{aligned}$$
|
||||
|
||||
# Exercise 3.19
|
||||
$$\begin{aligned}
|
||||
q_{\pi}(s,a) &= \mathbb{E}[G_t \vert S_t=s, A_t=a] \\
|
||||
&= \mathbb{E}[R_{t+1}+\gamma v_{\pi}(S_{t+1}) \vert S_t=s, A_t=a] \\
|
||||
&= \sum_{s',r} p(s',r\vert s,a)\left[r + \gamma v_{\pi}(s') \right] \\
|
||||
|
||||
\end{aligned}$$
|
||||
|
||||
# Exercise 3.20
|
||||
|
||||
|
||||
Reference in New Issue
Block a user