Reinforcement Learning Introduction


Learning from interaction with an environment in order to achieve long term goals. Uses a formal framework defining the interaction between a learning agent and its environment in terms of states, actions, and rewards. The concepts of value and value functions are the key features of most of the rl methods.

Definitions

A policy defines the learning agent's way of behaving at a given time, mapping from perceived states of env to actions to be takes when in those states, i.e. lookup table or stochastic decision. \(\pi\quad \pi(s)\quad \pi(a|s)=p(a_t=a|s_t=s)\). Action is not necessarily the policy.

A reward defines the goal in a rl problem, on each time step, the env sends to rl agent a single number. \(r_t \in \mathbb{R}\) The agent's sole objective is to maximize the total reward it recieves over the long term.

A value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. \(v(s)\) Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states.

$$V(s)\leftarrow V(s)+\underbrace{\alpha}_{\text{step-size}}\big[\underbrace{V(s')-V(s)}_{\text{temporal-difference}}\big]$$

The current value of the earlier state is adjusted to be closer to the value of the later state. The state value estimates the probability of winning.

A model explains how the environment behaves.

RL History: Trial and error → temporal-difference learning → Actor-critic

N-Armed Bandit Problem

The machine is repeatedly faced with a choice among n different options, or actions. After each choice, the machine receives a numerical reward chosen from a stationary probability distribution that depends on the action selected. The objective is to maximize the expected reward ovver some time period/steps. For example, slot machine with n choices, doctor choosing between experimental treatments.

If the value of the action is known, the n-armed bandit problem can be solved easily by selecting the action with highest value. But the challenge is how to estimate the value of the action.

Greedy action: choosing action based on its estimated value. (Exploiting the current knowledge of the values of the actions. High rewards in short term, maybe low in long term.)

Non-greedy action: improve the estimate of more actions. (Exploring. Low rewards in short term, maybe high in long term.)

References


  1. Standford Introduction of Reinforcement Learning