Bayesian Approach


$$CTR = \frac{\# clicks}{\# impressions}$$

where clicks are the number of clicks, impressions are defined as the number of showing the page. This is like the probability of being clicked.

Bernoulli Distribution

You can consider the CTR as the probability of being 1,

$$p(x) = \pi^x(1-\pi)^{1-x}=p(x|\pi)$$

where \(\pi\) is the probability of x=1, \(x\in [0,1]\).

Beta distribution

To use bayes rule, the prior of the probability is required. Read the link to understand why beta distribution can describe a probability of probability.

$$p(\pi) = \frac{1}{B(\alpha,\beta)}\pi^{\alpha-1}(1-\pi)^{\beta-1}$$ $$B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} = \frac{(\alpha-1)!(\beta-1)!}{(\alpha+\beta-1)!}$$

Online Learning

Using Bayes rule, the posterior \(p(\pi|x)\) can be obtained, it can become the new prior called conjugate prior. Intuitively, you have a prior assumption about the probability, as more events happen, your prior will get updated, this is the core idea of online learning.

$$ \begin{align*} p(\pi|X) &\propto p(X|\pi)p(\pi) = p(\pi)\prod_{i=1}\pi^{x_i}(1-\pi)^{1-x_i} \\ &\propto \pi^{\alpha-1}(1-\pi)^{\beta-1}\pi^{\sum_ix_i}(1-\pi)^{N-\sum_ix_i}\\ &=\pi^{\alpha-\sum_ix_i-1}(1-\pi)^{\beta+N-\sum_ix_i-1}=\pi^{\alpha'-1}(1-\pi)^{\beta'-1} \end{align*} $$

Therefore, the hyperparameters can be updated iteratively with more and more data,

$$\alpha'=\alpha-\sum_ix_i$$ $$\beta'=\beta+N-\sum_ix_i$$

Consider the mean and variance,

$$\mu = \frac{\alpha}{\alpha+\beta}$$ $$\sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$

The variance will decrease with number of samples.


Similarly,

Gaussian distribution

The rating can be described as a Gaussian distribution, assume that the prior, likelihood and posterior are normal distributed,

However, instead of using variance, it is more convenient to work with precision (inverse variance),

$$p(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})=p(x|\mu,\tau)=\sqrt{\frac{\tau}{2\pi}}exp(-\frac{1}{2}\tau(x-\mu)^2)$$

Model of prior

To model the hyperparameter distribution,

  1. Model \(p(\mu)\), \(\tau\) known, \(p(\mu)\sim\mathcal{N}\)
  2. Model \(p(\tau)\), \(\mu\) known, \(p(\tau)\sim\Gamma\)
  3. Model \(p(\mu,\tau)\), \(p(\mu)\sim\mathcal{N}\Gamma\)

Assume that, the likelihood and prior are written as

$$X\sim\mathcal{N}(\mu,\tau^{-1}) \quad \mu\sim\mathcal{N}(m,\lambda^{-1})$$

Online Learning

For only the first case,

$$ \begin{align*} p(\mu|X) &\propto p(X|\mu)p(\mu)\propto exp(-\frac{1}{2}\lambda(\mu-m)^2)\prod_i exp(-\frac{1}{2}\tau(x_i-\mu)^2)\\ &\propto exp(-\frac{1}{2}\lambda'(\mu-m')^2) \end{align*} $$

Expand and Compare terms in the exponent,

$$\begin{align*} &\lambda(\mu-m)^2+\tau\sum_i(x_i-\mu)^2=\mu^2(\lambda+\tau\sum_i 1)-2\mu(\lambda m+\tau\sum_ix_i)+\cdots \\ &=\mu^2\lambda'-2\mu\lambda'm'+\cdots\\ \end{align*}$$

Therefore,

$$\lambda' = \lambda+N\tau$$ $$m'=\frac{\lambda m+\tau\sum_ix_i}{\lambda'}=\frac{\lambda m+\tau\sum_ix_i}{\lambda+N\tau}$$

Explore and Exploit

Instead of using lower bound of the distribution, here will use the sampling from the distribution to ensure both exploration and exploitation. If we only use the mean, may not have exploration.

References


  1. Order Statistics by Colin Rundel
  2. How Does the Funk SVD Algorithm work in Recommendation Engines
  3. Normal-gamma distribution
  4. Bayesian Inference