where clicks are the number of clicks, impressions are defined as the number of showing the page. This is like the probability of being clicked.
You can consider the CTR as the probability of being 1,
$$p(x) = \pi^x(1-\pi)^{1-x}=p(x|\pi)$$where \(\pi\) is the probability of x=1, \(x\in [0,1]\).
To use bayes rule, the prior of the probability is required. Read the link to understand why beta distribution can describe a probability of probability.
$$p(\pi) = \frac{1}{B(\alpha,\beta)}\pi^{\alpha-1}(1-\pi)^{\beta-1}$$ $$B(\alpha,\beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} = \frac{(\alpha-1)!(\beta-1)!}{(\alpha+\beta-1)!}$$Using Bayes rule, the posterior \(p(\pi|x)\) can be obtained, it can become the new prior called conjugate prior. Intuitively, you have a prior assumption about the probability, as more events happen, your prior will get updated, this is the core idea of online learning.
$$ \begin{align*} p(\pi|X) &\propto p(X|\pi)p(\pi) = p(\pi)\prod_{i=1}\pi^{x_i}(1-\pi)^{1-x_i} \\ &\propto \pi^{\alpha-1}(1-\pi)^{\beta-1}\pi^{\sum_ix_i}(1-\pi)^{N-\sum_ix_i}\\ &=\pi^{\alpha-\sum_ix_i-1}(1-\pi)^{\beta+N-\sum_ix_i-1}=\pi^{\alpha'-1}(1-\pi)^{\beta'-1} \end{align*} $$Therefore, the hyperparameters can be updated iteratively with more and more data,
$$\alpha'=\alpha-\sum_ix_i$$ $$\beta'=\beta+N-\sum_ix_i$$Consider the mean and variance,
$$\mu = \frac{\alpha}{\alpha+\beta}$$ $$\sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$$The variance will decrease with number of samples.
Similarly,
The rating can be described as a Gaussian distribution, assume that the prior, likelihood and posterior are normal distributed,
However, instead of using variance, it is more convenient to work with precision (inverse variance),
$$p(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{(x-\mu)^2}{2\sigma^2})=p(x|\mu,\tau)=\sqrt{\frac{\tau}{2\pi}}exp(-\frac{1}{2}\tau(x-\mu)^2)$$To model the hyperparameter distribution,
Assume that, the likelihood and prior are written as
$$X\sim\mathcal{N}(\mu,\tau^{-1}) \quad \mu\sim\mathcal{N}(m,\lambda^{-1})$$For only the first case,
$$ \begin{align*} p(\mu|X) &\propto p(X|\mu)p(\mu)\propto exp(-\frac{1}{2}\lambda(\mu-m)^2)\prod_i exp(-\frac{1}{2}\tau(x_i-\mu)^2)\\ &\propto exp(-\frac{1}{2}\lambda'(\mu-m')^2) \end{align*} $$Expand and Compare terms in the exponent,
$$\begin{align*} &\lambda(\mu-m)^2+\tau\sum_i(x_i-\mu)^2=\mu^2(\lambda+\tau\sum_i 1)-2\mu(\lambda m+\tau\sum_ix_i)+\cdots \\ &=\mu^2\lambda'-2\mu\lambda'm'+\cdots\\ \end{align*}$$Therefore,
$$\lambda' = \lambda+N\tau$$ $$m'=\frac{\lambda m+\tau\sum_ix_i}{\lambda'}=\frac{\lambda m+\tau\sum_ix_i}{\lambda+N\tau}$$Instead of using lower bound of the distribution, here will use the sampling from the distribution to ensure both exploration and exploitation. If we only use the mean, may not have exploration.