Introduction of Gaussian Process


Bayesian Inference

At the weight space view, the target is estimated as follows, \(\mathbb{y=f+\epsilon=\Phi^\top w+\epsilon}\), where \(\mathbb{w}\sim\mathcal{N}(0,\mathbb{\Sigma_p})\) and \(\epsilon\sim \mathcal{N}\mathbb{0,\sigma^2_nI})\). The posterior is given by,

$$p(\mathbb{w|x,y,}\mathcal{M})=\mathcal{N}(\mathbb{w};\sigma^{-2}_n(\sigma^{-2}_n\Phi\Phi^\top+\mathbb{\Sigma_p^{-1}})^{-1}\mathbb{\Phi y},(\sigma_n^{-2}\Phi\Phi^\top+\Sigma_p^{-1})^{-1}) \\ = \mathcal{N}(\mathbb{w;\sigma^{-2}_n}A^{-1}\Phi y,A^{-1})$$

The marginal likelihood is written as,

$$p(\mathbb{y|x,}\mathcal{M})=\mathcal{N}(\mathbb{y;0,\Phi^\top\Sigma_p\Phi+\sigma^2_nI})=\mathcal{N}(\mathbb{y;0,}K+\sigma^2_n\mathbb{I})$$

The predictive distribution is derived as follows,

$$p(\mathbb{y_*|x_*,x,y,}\mathcal{M})=\mathcal{N}(\mathbb{y_*};\sigma^{-2}_n\phi_*^\top(\sigma^{-2}_n\Phi\Phi^\top + \mathbb{\Sigma_p^{-1}})^{-1}\Phi\mathbb{y},\phi^\top_*(\sigma^{-2}_n\Phi\Phi^\top+\Sigma_p^{-1})^{-1}\phi_*+\sigma^2_n\mathbb{I}) \\ = \mathcal{N}(\mathbb{y_*;\sigma^{-2}_n\phi_*^\top}A^{-1}\Phi\mathbb{y},\phi^\top_*A^{-1}\phi_*+\sigma^2_n\mathbb{I})$$

Consider the definition of \(A\) and \(K\), the posterior and predictive can be rewritten,

\begin{align*} A&=\sigma^{-2}_n\Phi\Phi^\top+\Sigma_p^{-1} \\ A\Sigma_p &= \sigma^{-2}_n\Phi\Phi^\top\Sigma_p+\mathbb{I}\\ A\Sigma_p\Phi &= \sigma^{-2}_n\Phi\Phi^\top\Sigma_p\Phi+\Phi\\ &= \sigma^{-2}_n\Phi(K+\sigma^2_n\mathbb{I})\\ \Sigma_p\Phi(K+\sigma^2_n\mathbb{I})^{-1} &= \sigma^{-2}_nA^{-1}\Phi\\ \end{align*}

and by matrix cookbook inverse (woodbury identity), \((A+CBC^\top)^{-1}=A^{-1}-A^{-1}C(B^{-1}+C^\top A^{-1}C)^{-1}C^\top A^{-1}\),

\begin{align*} A^{-1}&=(\Phi\sigma^{-2}_n\mathbb{I}\Phi^\top+\Sigma_p^{-1})^{-1} \\ &= \Sigma_p-\Sigma_p\Phi(\Phi^\top\Sigma_p\Phi+\sigma_n^2\mathbb{I})^{-1}\Phi^\top\Sigma_p \end{align*}

Therefore, the posterior is given by,

\begin{align*} p(\mathbb{w|x,y,\mathcal{M}}) = \mathcal{N}( \Sigma_p\Phi(K+\sigma^2_n\mathbb{I})^{-1}y,\Sigma_p-\Sigma_p\Phi(K+\sigma_n^2\mathbb{I})^{-1}\Phi^\top\Sigma_p) \end{align*}

The predictive becomes,

\begin{align*} &p(\mathbb{y_*|x_*,x,y,}\mathcal{M})=\mathcal{N}(\mathbb{y_*;}\phi_*^\top\Sigma_p\Phi(K+\sigma^2_n\mathbb{I})^{-1}\mathbb{y},\phi_*^\top\Sigma_p\phi_*-\phi_*^\top\Sigma_p\Phi(K+\sigma^2_n\mathbb{I})^{-1}\Phi^\top\Sigma_p\phi_*+\sigma^2_n\mathbb{I}) \\ &=\mathcal{N}(\mathbb{y_*};K(\mathbb{x_*,x})(K+\sigma^2_n\mathbb{I})^{-1}\mathbb{y},K(\mathbb{x_*,x_*})+\sigma^2_n\mathbb{I}-K(\mathbb{x_*,x})(K+\sigma^2_n\mathbb{I})^{-1}K(\mathbb{x,x_*})) \end{align*}

Function Space View

Among all the expressions above, only the marginal likelihood seems to give the initial insight over the functional space given the inputs. It can be interpreted as, the model selection is determined by marginal likelihood \(p(\mathbb{y|x,}\mathcal{M})\), the parameters are tuned by posterior maximization \(p(\mathbb{w|x,y,}\mathcal{M})\), and the prediction is made by \(p(\mathbb{y_*|x_*,x,y,}\mathcal{M})\).

References


  1. The Matrix Cookbook
  2. Gaussian Processes for Machine Learning