Bayesian Inference

Model

What's expected of a model is to predict a target $y$ based on its input $x$. So usually it involves the procedure of fitting (or training, learning based on the data), tuning the parameters, and predicting in the deployment.

In the mathematical terminology, given a dataset $\mathrm{D} = \{x_i,y_i\}^N_{i=1}=(\mathbb{x},\mathbb{y})$ of N pairs of inputs $x_i$ and targets (outputs) $y_i$, we would like to have a model $\mathcal{M}$ with function $f$ and parameters $\theta$ to give accurate predictions.

For example, the polynomial regression,

$$f_\mathbb{w}(x) = w_0+w_1x+w_2x^2+\cdots+w_Mx^M$$

Can be interpreted as follows, the model $\mathcal{M}$ is assumed to be the polynomial regression model, the paramters $\theta=(w_0,w_1,...,w_M,M)=(\mathbb{w},M)$, and using $f_\mathbb{w}\sim y$.

To be more compact, the function can be rewritten in the form of basis functions, which is a transformation of the input $\phi_j(x)=x^j$, so $f_\mathbb{w}(x_i)=\sum^M_{j=1}w_j\phi_j(x_i)$. All the inputs and outputs can be stacked up in a matrix form $\Phi_{ij}=\phi_j(x_i)$,

$$\mathbb{f}_\mathbb{w}(\mathbb{x})=\begin{pmatrix}f_\mathbb{w}(x_1)\\f_\mathbb{w}(x_2)\\\vdots\\f_\mathbb{w}(x_N)\end{pmatrix}=\begin{pmatrix}1& \cdots & x_1^M\\\vdots &\ddots & \vdots\\1 & \cdots & x_N^M\end{pmatrix}\begin{pmatrix}w_0\\w_1\\\vdots\\w_M\end{pmatrix}=\Phi\mathbb{w}$$

Loss

To estimate the performance of the model, a scalar L2 loss (error) is commonly used.

$$E(\mathbb{w})=\sum^N_{i=1}e^2_i=\|\mathbb{e}\|^2=\mathbb{e}^\top\mathbb{e}=\sum^N_{i=1}\big(y_i-f(x_i)\big)^2=(\mathbb{y}-\mathbb{f})^\top(\mathbb{y}-\mathbb{f})$$

Noise

Ideally without noise, $\mathbb{y}=\mathbb{f}_\mathbb{w}(\mathbb{x})$, the best fit parameters are given by,

\begin{align*} E(\mathbb{w}) &= (\mathbb{y}-\Phi\mathbb{w})^\top(\mathbb{y}-\Phi\mathbb{w}) \\ &= \mathbb{y}^\top\mathbb{y}-2\mathbb{y}^\top\Phi\mathbb{w}-\mathbb{w}^\top\Phi^\top\Phi\mathbb{w} \\ \frac{\partial E}{\partial\mathbb{w}} &= 2\Phi^\top\Phi\mathbb{w}-2\Phi^\top\mathbb{y} = 0\\ \mathbb{w}&=(\Phi^\top\Phi)^{-1}\Phi^\top\mathbb{y}\\ \end{align*}

However, this can lead to overfitting under noisy environment. To reduce the effect of overfitting, the regularization is used to penalize the weight as follows, $E(\mathbb{w} = (\mathbb{y}-\Phi\mathbb{w})^\top(\mathbb{y}-\Phi\mathbb{w}))+\lambda\mathbb{w}^\top\mathbb{w}$. This can be justified by bayesian inference.

Posterior Regularization

Bayes' rule states that, $p(a|b)=\frac{p(b|a)p(a))}{p(b)}=\frac{p(a,b)}{p(b)}$. To find the best fit parameters for the model, we would like to maximize the posterior parameter distribution $p(\mathbb{w}|\mathbb{x},\mathbb{y},\mathcal{M})$ given the training data,

\begin{align*} p(\mathbb{w}|\mathbb{x},\mathbb{y},\mathcal{M})&= \frac{p(\mathbb{w},\mathbb{x},\mathbb{y}|\mathcal{M})}{p(\mathbb{x},\mathbb{y}|\mathcal{M})}=\frac{p(\mathbb{w})p(\mathbb{x},\mathbb{y}|\mathbb{w})}{p(\mathbb{y|x})p(\mathbb{x})}\\ &= \frac{p(\mathbb{w})p(\mathbb{y|x,w})p(\mathbb{x})}{p(\mathbb{y|x})p(\mathbb{x})} = \frac{\overbrace{p(\mathbb{w}|\mathcal{M})}^{prior}\overbrace{p(\mathbb{y|x,w},\mathcal{M})}^{likelihood}}{\underbrace{p(\mathbb{y|x,\mathcal{M}})}_{marginal\quad likelihood}} \end{align*}

To estimate which model (degree of polynomial) is the best, the marginal likelihood is used.

\begin{align*} p(\mathbb{y|x},\mathcal{M})=\int p(\mathbb{y,w|x,\mathcal{M}})d\mathbb{w}=\int p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})d\mathbb{w} \end{align*}

After the model is tuned, the predictive distribution is used.

\begin{align*} p(\mathbb{y_*|x_*,x,y,\mathcal{M}}) &= \int p(\mathbb{y_*,w|x_*,x,y,\mathcal{M}})d\mathbb{w} \\ &= \int p(\mathbb{y_*|w,x_*,\mathcal{M}})p(\mathbb{w|x,y,\mathcal{M}})d\mathbb{w} \end{align*}

More derivations

In terms of posterior parameter distribution, assume that the noise is normal distributed, $y=f+\epsilon$, where $ \epsilon\sim\mathcal{N}(0,\sigma^2_n) $ so the likelihood $p(\mathbb{y|x,w,\mathcal{M}})=\mathcal{N}(\mathbb{y;\Phi w,\sigma^2_n I})$, as well as the prior of the weight $p(\mathbb{w}|\mathcal{M})=\mathcal{N}(\mathbb{w;0,\sigma^2_wI})$. Since the posterior is approximate to the product of prior and likelihood, it should be normally distributed too, i.e., $p(\mathbb{w|x,y,\mathcal{M}})\sim\mathcal{N}(\mu,\Sigma)$.

\begin{align*} &\text{posterior}= \mathcal{N}(\mathbb{w};\mu,\Sigma) \approx e^{-\frac{1}{2}\mathbb{w}^\top\Sigma^{-1}\mathbb{w}+\mathbb{w}^\top\Sigma^{-1}\mu}\\ &\text{product of prior and likelihood}=\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\mathcal{N}(\mathbb{y;\Phi w,\sigma^2_nI}) \\ &\approx e^{-\frac{1}{2}(\sigma^{-2}_\mathbb{w}\mathbb{w^\top w}+\sigma^{-2}_n\mathbb{(y-\Phi w)^\top(y-\Phi w)})} \approx e^{-\frac{1}{2}\mathbb{w^\top(\sigma^{-2}_n\Phi^\top\Phi+\sigma^{-2}_wI)}+\mathbb{\sigma^{-2}_n w^\top\Phi^\top y}}\\ \end{align*}

Comparing the terms,

\begin{align*} \Sigma^{-1}&=\sigma^{-2}_n\Phi^\top\Phi+\sigma^{-2}_w\mathbb{I} \\ \mu &= (\Phi^\top\Phi+\frac{\sigma^2_n}{\sigma^{2}_w}\mathbb{I})^{-1}\Phi^\top\mathbb{y}\\ \end{align*}

In terms of marginal likelihood, since $\mathbb{y=\Phi w+\epsilon}$ and $\mathbb{dy=d\epsilon}$,

\begin{align*} &p(\mathbb{y|x},\mathcal{M})=\int p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})d\mathbb{w}\approx\mathcal{N}(\mathbb{y};\mu_{\mathbb{y|x}},\Sigma_{\mathbb{y|x}}) \\ &\mu_{\mathbb{y|x}} = \mathbb{\int dy\int dw}(\mathbb{y})p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})\\ &=\mathbb{\int d\epsilon\int dw}(\mathbb{\Phi w+\epsilon})\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\mathcal{N}(\mathbb{\epsilon;0,\sigma^2_n I}) = 0 \\ &\Sigma_{\mathbb{y|x}} = \mathbb{\int dy\int dw}(\mathbb{yy^\top})p(\mathbb{w|x,}\mathcal{M})p(\mathbb{y|x,w,}\mathcal{M})\\ &= \mathbb{\int d\epsilon\int dw}(\mathbb{\Phi ww^\top\Phi^\top+\epsilon\epsilon^\top}+2\epsilon w^\top\Phi^\top)\mathcal{N}(\mathbb{w;0,\sigma^2_wI})\mathcal{N}(\mathbb{\epsilon;0,\sigma^2_n I})\\ &= \mathbb{\sigma^2_w\Phi\Phi^\top+\sigma^2_nI} \end{align*}

For predictive distribution, $\mathbb{y_*=\phi_*^\top w+\epsilon}$,

\begin{align*} &p(\mathbb{y_*|x_*,x,y,\mathcal{M}}) = \int p(\mathbb{y_*|w,x_*,\mathcal{M}})p(\mathbb{w|x,y,\mathcal{M}})d\mathbb{w} \approx \mathcal{N}(\mathbb{y_*};\mu_*,\Sigma_*)\\ &\mu_* = \mathbb{\int dy_*\int dw}(\mathbb{y_*})p(\mathbb{y_*|w,x_*,\mathcal{M}})p(\mathbb{w|x,y,\mathcal{M}})\\ &= \mathbb{\int d\epsilon\int dw}(\mathbb{\phi_*^\top w+\epsilon})\mathcal{N}(\mathbb{w;\mu,\Sigma})\mathcal{N}\mathbb{(\epsilon;0,\sigma^2_nI)}=\mathbb{\phi_*^\top\mu}\\ &\Sigma_* = \mathbb{\int d\epsilon\int dw}(\mathbb{\phi_*^\top ww^\top\phi_*+\epsilon\epsilon^\top+2\phi^\top_*w\epsilon^\top})\mathcal{N}(\mathbb{w;\mu,\Sigma})\mathcal{N}\mathbb{(\epsilon;0,\sigma^2_nI)}\\ &= \phi_*^\top\Sigma\phi_*+\sigma^2_n\mathbb{I} \end{align*}