State-Action-Reward-State-Action


Cliff world

The update is evaluated after every transition from a non-terminal state \(s_t\), following the events \((s_t,a_t,r_t,s_{t+1},a_{t+1})\). If \(s_{t+1}\), then \(Q(s_{t+1},a_{t+1})\) is defined as zero.

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t)+\alpha(r_{t+1}+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t)))$$

References