Reinforcement Learning Introduction

Learning from interaction with an environment in order to achieve long term goals. Uses a formal framework defining the interaction between a learning agent and its environment in terms of states, actions, and rewards. The concepts of value and value functions are the key features of most of the rl methods.

Definitions

A policy defines the learning agent's way of behaving at a given time, mapping from perceived states of env to actions to be takes when in those states, i.e. lookup table or stochastic decision. $\pi\quad \pi(s)\quad \pi(a|s)=p(a_t=a|s_t=s)$. Action is not necessarily the policy.

Exploit current knowledge
Explore new actions

A reward defines the goal in a rl problem, on each time step, the env sends to rl agent a single number. $r_t \in \mathbb{R}$ The agent's sole objective is to maximize the total reward it recieves over the long term.

Episodic tasks: interaction terminates after finite number of steps.
Continuing tasks: interaction has no limit.

where $\gamma$ is the dicount factor, $0\leq\gamma < 1$. $\gamma=0$ the agent is miopic, $=1$ is farsighted.

A value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. $v(s)$ Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow, and the rewards available in those states.

$$V(s)\leftarrow V(s)+\underbrace{\alpha}_{\text{step-size}}\big[\underbrace{V(s')-V(s)}_{\text{temporal-difference}}\big]$$

The current value of the earlier state is adjusted to be closer to the value of the later state. The state value estimates the probability of winning.

A model explains how the environment behaves.

Model-based methods use models and planning to solve rl problems.
Model-free methods are based on explicitly trial-and-error learners.
Evolutionary method: policy made only after many games, and only final outcome of each game is used.
Value function method: individual states are evaluated.
On-policy methods which attempt to evaluate or improve the policy that is used to make decisions.
Off-policy methods which evaluate or improve a policy different from that used to generate the data.

RL History: Trial and error → temporal-difference learning → Actor-critic

N-Armed Bandit Problem

The machine is repeatedly faced with a choice among n different options, or actions. After each choice, the machine receives a numerical reward chosen from a stationary probability distribution that depends on the action selected. The objective is to maximize the expected reward ovver some time period/steps. For example, slot machine with n choices, doctor choosing between experimental treatments.

If the value of the action is known, the n-armed bandit problem can be solved easily by selecting the action with highest value. But the challenge is how to estimate the value of the action.

Greedy action: choosing action based on its estimated value. (Exploiting the current knowledge of the values of the actions. High rewards in short term, maybe low in long term.)

Non-greedy action: improve the estimate of more actions. (Exploring. Low rewards in short term, maybe high in long term.)

References

Standford Introduction of Reinforcement Learning