Bayesian Inference


Model

What's expected of a model is to predict a target \(y\) based on its input \(x\). So usually it involves the procedure of fitting (or training, learning based on the data), tuning the parameters, and predicting in the deployment.

In the mathematical terminology, given a dataset \(\mathrm{D} = \{x_i,y_i\}^N_{i=1}=(\mathbb{x},\mathbb{y})\) of N pairs of inputs \(x_i\) and targets (outputs) \(y_i\), we would like to have a model \(\mathcal{M}\) with function \(f\) and parameters \(\theta\) to give accurate predictions.

For example, the polynomial regression,

$$f_\mathbb{w}(x) = w_0+w_1x+w_2x^2+\cdots+w_Mx^M$$

Can be interpreted as follows, the model \(\mathcal{M}\) is assumed to be the polynomial regression model, the paramters \(\theta=(w_0,w_1,...,w_M,M)=(\mathbb{w},M)\), and using \(f_\mathbb{w}\sim y\).

To be more compact, the function can be rewritten in the form of basis functions, which is a transformation of the input \(\phi_j(x)=x^j\), so \(f_\mathbb{w}(x_i)=\sum^M_{j=1}w_j\phi_j(x_i)\). All the inputs and outputs can be stacked up in a matrix form \(\Phi_{ij}=\phi_j(x_i)\),

$$\mathbb{f}_\mathbb{w}(\mathbb{x})=\begin{pmatrix}f_\mathbb{w}(x_1)\\f_\mathbb{w}(x_2)\\\vdots\\f_\mathbb{w}(x_N)\end{pmatrix}=\begin{pmatrix}1& \cdots & x_1^M\\\vdots &\ddots & \vdots\\1 & \cdots & x_N^M\end{pmatrix}\begin{pmatrix}w_0\\w_1\\\vdots\\w_M\end{pmatrix}=\Phi\mathbb{w}$$

Loss

To estimate the performance of the model, a scalar L2 loss (error) is commonly used.

$$E(\mathbb{w})=\sum^N_{i=1}e^2_i=\|\mathbb{e}\|^2=\mathbb{e}^\top\mathbb{e}=\sum^N_{i=1}\big(y_i-f(x_i)\big)^2=(\mathbb{y}-\mathbb{f})^\top(\mathbb{y}-\mathbb{f})$$

Noise

Ideally without noise, \(\mathbb{y}=\mathbb{f}_\mathbb{w}(\mathbb{x})\), the best fit parameters are given by,
\begin{align*} E(\mathbb{w}) &= (\mathbb{y}-\Phi\mathbb{w})^\top(\mathbb{y}-\Phi\mathbb{w}) \\ &= \mathbb{y}^\top\mathbb{y}-2\mathbb{y}^\top\Phi\mathbb{w}-\mathbb{w}^\top\Phi^\top\Phi\mathbb{w} \\ \frac{\partial E}{\partial\mathbb{w}} &= 2\Phi^\top\Phi\mathbb{w}-2\Phi^\top\mathbb{y} = 0\\ \mathbb{w}&=(\Phi^\top\Phi)^{-1}\Phi^\top\mathbb{y}\\ \end{align*}

However, this can lead to overfitting under noisy environment. To reduce the effect of overfitting, the regularization is used to penalize the weight as follows, \(E(\mathbb{w} = (\mathbb{y}-\Phi\mathbb{w})^\top(\mathbb{y}-\Phi\mathbb{w}))+\lambda\mathbb{w}^\top\mathbb{w}\). This can be justified by bayesian inference.

Posterior Regularization

Bayes' rule states that, \(p(a|b)=\frac{p(b|a)p(a))}{p(b)}=\frac{p(a,b)}{p(b)}\). To find the best fit parameters for the model, we would like to maximize the posterior parameter distribution \(p(\mathbb{w}|\mathbb{x},\mathbb{y},\mathcal{M})\) given the training data,

\begin{align*} p(\mathbb{w}|\mathbb{x},\mathbb{y},\mathcal{M})&= \frac{p(\mathbb{w},\mathbb{x},\mathbb{y}|\mathcal{M})}{p(\mathbb{x},\mathbb{y}|\mathcal{M})}=\frac{p(\mathbb{w})p(\mathbb{x},\mathbb{y}|\mathbb{w})}{p(\mathbb{y|x})p(\mathbb{x})}\\ &= \frac{p(\mathbb{w})p(\mathbb{y|x,w})p(\mathbb{x})}{p(\mathbb{y|x})p(\mathbb{x})} = \frac{\overbrace{p(\mathbb{w}|\mathcal{M})}^{prior}\overbrace{p(\mathbb{y|x,w},\mathcal{M})}^{likelihood}}{\underbrace{p(\mathbb{y|x,\mathcal{M}})}_{marginal\quad likelihood}} \end{align*}

To estimate which model (degree of polynomial) is the best, the marginal likelihood is used.

\begin{align*} p(\mathbb{y|x},\mathcal{M})=\int p(\mathbb{y,w|x,\mathcal{M}})d\mathbb{w}=\int p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})d\mathbb{w} \end{align*}

After the model is tuned, the predictive distribution is used.

\begin{align*} p(\mathbb{y_*|x_*,x,y,\mathcal{M}}) &= \int p(\mathbb{y_*,w|x_*,x,y,\mathcal{M}})d\mathbb{w} \\ &= \int p(\mathbb{y_*|w,x_*,\mathcal{M}})p(\mathbb{w|x,y,\mathcal{M}})d\mathbb{w} \end{align*}

More derivations

In terms of posterior parameter distribution, assume that the noise is normal distributed, \(y=f+\epsilon\), where \( \epsilon\sim\mathcal{N}(0,\sigma^2_n) \) so the likelihood \(p(\mathbb{y|x,w,\mathcal{M}})=\mathcal{N}(\mathbb{y;\Phi w,\sigma^2_n I})\), as well as the prior of the weight \(p(\mathbb{w}|\mathcal{M})=\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\). Since the posterior is approximate to the product of prior and likelihood, it should be normally distributed too, i.e., \(p(\mathbb{w|x,y,\mathcal{M}})\sim\mathcal{N}(\mu,\Sigma)\).

\begin{align*} &\text{posterior}= \mathcal{N}(\mathbb{w};\mu,\Sigma) \approx e^{-\frac{1}{2}\mathbb{w}^\top\Sigma^{-1}\mathbb{w}+\mathbb{w}^\top\Sigma^{-1}\mu}\\ &\text{product of prior and likelihood}=\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\mathcal{N}(\mathbb{y;\Phi w,\sigma^2_nI}) \\ &\approx e^{-\frac{1}{2}(\sigma^{-2}_\mathbb{w}\mathbb{w^\top w}+\sigma^{-2}_n\mathbb{(y-\Phi w)^\top(y-\Phi w)})} \approx e^{-\frac{1}{2}\mathbb{w^\top(\sigma^{-2}_n\Phi^\top\Phi+\sigma^{-2}_wI)}+\mathbb{\sigma^{-2}_n w^\top\Phi^\top y}}\\ \end{align*}

Comparing the terms,

\begin{align*} \Sigma^{-1}&=\sigma^{-2}_n\Phi^\top\Phi+\sigma^{-2}_w\mathbb{I} \\ \mu &= (\Phi^\top\Phi+\frac{\sigma^2_n}{\sigma^{2}_w}\mathbb{I})^{-1}\Phi^\top\mathbb{y}\\ \end{align*}

In terms of marginal likelihood, since \(\mathbb{y=\Phi w+\epsilon}\) and \(\mathbb{dy=d\epsilon}\),

\begin{align*} &p(\mathbb{y|x},\mathcal{M})=\int p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})d\mathbb{w}\approx\mathcal{N}(\mathbb{y};\mu_{\mathbb{y|x}},\Sigma_{\mathbb{y|x}}) \\ &\mu_{\mathbb{y|x}} = \mathbb{\int dy\int dw}(\mathbb{y})p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})\\ &=\mathbb{\int d\epsilon\int dw}(\mathbb{\Phi w+\epsilon})\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\mathcal{N}(\mathbb{\epsilon;0,\sigma^2_n I}) = 0 \\ &\Sigma_{\mathbb{y|x}} = \mathbb{\int dy\int dw}(\mathbb{yy^\top})p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})\\ &= \mathbb{\int d\epsilon\int dw}(\mathbb{\Phi ww^\top\Phi^\top+\epsilon\epsilon^\top}+2\epsilon w^\top\Phi^\top)\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\mathcal{N}(\mathbb{\epsilon;0,\sigma^2_n I})\\ &= \mathbb{\sigma^2_w\Phi\Phi^\top+\sigma^2_nI} \end{align*}

For predictive distribution, \(\mathbb{y_*=\phi_*^\top w+\epsilon}\),

\begin{align*} &p(\mathbb{y_*|x_*,x,y,\mathcal{M}}) = \int p(\mathbb{y_*|w,x_*,\mathcal{M}})p(\mathbb{w|x,y,\mathcal{M}})d\mathbb{w} \approx \mathcal{N}(\mathbb{y_*};\mu_*,\Sigma_*)\\ &\mu_* = \mathbb{\int dy_*\int dw}(\mathbb{y_*})p(\mathbb{y_*|w,x_*,\mathcal{M}})p(\mathbb{w|x,y,\mathcal{M}})\\ &= \mathbb{\int d\epsilon\int dw}(\mathbb{\phi_*^\top w+\epsilon})\mathcal{N}(\mathbb{w;\mu,\Sigma})\mathcal{N}\mathbb{(\epsilon;0,\sigma^2_nI)}=\mathbb{\phi_*^\top\mu}\\ &\Sigma_* = \mathbb{\int d\epsilon\int dw}(\mathbb{\phi_*^\top ww^\top\phi_*+\epsilon\epsilon^\top+2\phi^\top_*w\epsilon^\top})\mathcal{N}(\mathbb{w;\mu,\Sigma})\mathcal{N}\mathbb{(\epsilon;0,\sigma^2_nI)}\\ &= \phi_*^\top\Sigma\phi_*+\sigma^2_n\mathbb{I} \end{align*}

Therefore, the weight should not be estimated by the maximum likelihood, but the maximum posterior.

References