Policy Iteration


Value Functions

Value functions (of states/state-action pairs) are to estimate how good it is for the agent to be in a given state (/to perform a given action and state).

Consider \(p(r|s)\),

$$p(r|s)=\sum_a\sum_{s'}p(r,a,s'|s)=\sum_a\sum_{s'}p(r,s'|a,s)p(a|s)=\sum_a\pi(a|s)\sum_{s'}p(s',r|s,a)$$

Expected reward of a given state,

$$\sum_r rp(r|s)=\sum_a\pi(a|s)\sum_{s'}\sum_r rp(s',r|s,a)$$

Expected return when starting in s and following \(\pi\),

$$ \begin{align*} v_\pi(s)&=\mathbb{E}_\pi[G_t|S_t=s]=\mathbb{E}_\pi\big[\sum^\infty_{k=0}\gamma^kR_{t+k+1}|S_t=s\big] \\ &=\mathbb{E}_\pi\big[R_{t+1}+\gamma\sum^\infty_{k=0}\gamma^kR_{t+k+2}|S_t=s\big]\\ &=\sum_a\pi(a|s)\sum_{s'}\sum_rp(s',r|s,a)\bigg[r+\gamma\mathbb{E}_\pi\big[\sum^\infty_{k=0}\gamma_kR_{t+k+2}|S_{t+1}=s'\big]\bigg]\\ &=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')] \end{align*} $$

Interestingly, you can see the value of a state given a policy is the sum of the future state value. This represents the value of the current state is related to the future states, not the previous states. This is the Bellman equation for \(v_\pi\).

Similarly,

$$p(r|s,a)=\sum_{s'}p(s',r|s,a)$$ $$\sum_rp(r|s,a) = \sum_r\sum_{s'}p(s',r|s,a)$$

The expected return starting from s, taking the action a, and following policy \(\pi\).

$$q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]=\mathbb{E}_\pi\big[\sum^\infty_{k=0}\gamma^kR_{t+k+1}|S_t=s,A_t=a\big]\\ =\sum_{s',r}p(s',r|s,a)(r+\gamma q_\pi(s',a')) $$

The backup diagram (existing in the book) shows the relationships that form the basis of the update, it transfer value information back to a state from its successor states.

Optimal Value Functions

A policy \(\pi\) is defined to be better than or equal to a policy \(\pi'\) for all states, i.e. \(\pi\geq\pi'\) if and only if \(v_\pi(s)\geq v_{\pi'}(s)\) for all \(s\in\mathfrak{S}\). This is an optimal policy. The optimal state-value function is defined as,

$$v_*(s) = \max_\pi v_\pi(s)$$

The optimal action-value function is defined as,

$$q_*(s,a)=\max_\pi q_\pi(s,a)$$

The Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state,

$$v_*(s)=\max_{a\in\mathfrak{A}(s)}q_{\pi*}(s,a)=\max_a\mathbb{E}_{\pi*}[G_t|S_t=s,A_t=a]\\ =\max_{a\in\mathfrak{A}(s)}\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]\\ q_*(s,a) = \sum_{s',r}p(s',r|s,a)[r+\gamma\max_{a'}q_*(s',a')] $$

This solution relies on at least 3 assumptions that are rarely true in practice. (1) we accurately know the dynamics of the environment. (2) we have enough computational resources to complete the computation of the solution. (3) the Markov property.

Dynamic Programming

Dynamic programming referes to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a MDP.


Grid world

Steps