Markov Decision Processes


Agent-Environment Interface

The learner and decision-maker = agent, the thing it interacts with = environment. The agent takes action \(A_t\) in the environment, it changes the state \(S_t\) of itself in the environment, for example, location, and the environment feedbacks the reward \(R_t\) to the agent, i.e. \(S_t \xrightarrow{A_t} S_{t+1}R_{t+1}\).

Imagine that, you have a robot. The robot's action can be a plan, some non-physical output, the environment can be its body or the floor, etc. Rewards are just sth non physical, an intrinsic value to estimate how close the goal is. The agent is the algorithm, the consciousness?

Goals and Rewards

Goal is the maximization of the expected value of the cumulative sum of a received scalar signal called reward. Designing the reward function can be a challenge, it tells the robot what you want it to achieve, not how you want it achieved.

Returns

This has been discussed in introduction. Just quickly mention right here,

$$G_t=R_{t+1}+R_{t+2}+\cdots+R_T$$ $$G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots = \sum^\infty_{k=0}\gamma^k R_{t+k+1}$$ $$G_t=\sum^{T-t-1}_{k=0}\gamma^k R_{t+k+1}$$

Markov Property

The state is any information available to the agent. Main concern is not with designing the state signal, but with deciding with what action to take as a function of whatever state signal is available. A state signal that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. An independence of path property is defined as that all that matters is in the current state signal.

The probability of responses (the environment dynamics),

$$Pr(R_{t+1}=r,S_{t+1}=s'|\{S_i\}_{i=0}^t,\{A_i\}_{i=0}^t,\{R\}_{i=1}^t)$$

If the state signal has the Markov property, the environment's response at \(t+1\) depends only on the state and action representations at \(t\). The environment's dynamics becomes,

$$p(s',r|s,a) = Pr(R_{t+1}=r,S_{t+1}=s'|S_t=s,A_t=a)$$

This is named as one-step dynamics of the environment.

Markov Decision Processes (MDP)

The expected rewards for state-action pairs,

$$r(s,a)=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]=\sum_{r\in\mathfrak{R}}r\sum_{s'\in\mathfrak{S}}p(s',r|s,a)$$

The state-transition probabilities,

$$p(s'|s,a)=Pr(S_{t+1}=s'|S_t=s,A_t=a)=\sum_{r\in\mathfrak{R}}p(s',r|s,a)=\mathfrak{P}^a_{ss'}$$

The expected rewards for state-action-next-state triples,

$$r(s,a,s')=\mathbb{E}[R_{t+1}|S_t=s,A_t=a,S_{t+1}=s']=\frac{\sum_{r\in\mathfrak{R}}rp(s',r|s,a)}{p(s'|s,a)}=\mathfrak{R}^a_{ss'}$$

A transition graph is a useful way to summarize the dynamics of a finite MDP, mainly the last two quantities, the state transition probabilities and rewards for triples. It has two features, state node for each possible state, and an action node for each state-action pair. (may relate to message-passing graphical model)

References