Collaborative Filtering


We can further design a score for ranking personalized for each client.

Originally, a score for each item j is defined as \(s_j\), where there are M items.

$$s_j=\frac{1}{|\Omega_j|}\sum_{i\in\Omega_j}r_{ij}$$

where \(\Omega_j\) set of all users who rated item j, and \(r_{ij}\) rating user i gave item j.

To personalize the score, it is rewritten as, no modification at the moment, just introduce one more index for s,

$$s_{ij}=\frac{1}{|\Omega_j|}\sum_{i'\in\Omega_j}r_{i'j}=\hat{r}_{ij}$$

where \(\hat{r}\) is the prediction, \(r\) is the target ground truth.

Now define \(R_{N\times M}\) is an user-item ratings matrix of size N\(\times\)M. It resembles the word-document matrix, thats why it is closely related to NLP. We hope to do matrix factorization, to separate this matrix into user matrix and item matrix. You have to also concern about sparsity of the matrix.

It is a regression problem, therefore, MSE loss is used.

$$MSE = \frac{1}{\Omega}\sum_{i,j\in\Omega}(r_{ij}-\hat{r}_{ij})^2$$

User-User Collaborative Filtering

The idea behind collaborative filtering is to discover the correlation between users' preferences. To harness the power of the correlated preferences, the algorithm is designed to weight the rating more if similar user has similar common preferences.

$$s_{ij}=\frac{1}{\sum_{i'\in\Omega_j}w_{ii'}}\sum_{i'\in\Omega_j}w_{ii'}r_{i'j}$$

where \(w_{ii'}\) the weight between user i and i' indicating the correlation. Another issue with average rating is that users can be biased, like some are more optimistic giving good movies 5 bad movies 3, but pessimistic good for bad 1.

This can be resolved by deviation, this can reduce the effect of bias,

$$\delta_{ij}=r_{ij}-\bar{r}_i$$

So that the second equation can be rewritten as,

\begin{align*} s_{ij}&=\frac{1}{|\Omega_j|}\sum_{i'\in\Omega_j}r_{i'j}=\bar{r}_i+\frac{1}{|\Omega_j|}\sum_{i'\in\Omega_j}(r_{i'j}-\bar{r}_{i'})=\bar{r}_i+\hat{\delta}_{ij}\\ \end{align*}

Combine with the weight sum,

$$s_{ij}=\bar{r}_i+\frac{1}{\sum_{i'\in\Omega_j}|w_{ii'}|}\sum_{i'\in\Omega_j}w_{ii'}(r_{i'j}-\bar{r}_{i'})$$
Weights

Pearson correlation coefficient is given by,

$$\varrho_{xy}=\frac{\sum^N_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum^N_{i=1}(x_i-\bar{x})^2\sum^N_{i=1}(y_i-\bar{y})^2}}$$

Cosine similarity is given by,

$$\cos\theta = \frac{x^\top y}{|x||y|} = \frac{\sum^N_{i=1}x_iy_i}{\sqrt{\sum^N_{i=1}x_i^2}\sqrt{\sum^N_{i=1}y_i^2}}$$

Either one can be used as weight calculation, they are very similar since pearson correlation is just a centered version of cosine similarity. Therefore,

$$w_{ii'}=\frac{\sum_{j\in\Psi_{ii'}}(r_{ij}-\bar{r}_i)(r_{i'j}-\bar{r}_{i'})}{\sqrt{\sum_{j\in\Psi_{ii'}}(r_{ij}-\bar{r}_i)^2}\sqrt{\sum_{j\in\Psi_{ii'}}(r_{i'j}-\bar{r}_{i'})^2}}$$

where \(\Psi_i\) set of movies that user i has rated, \(\Psi_{ii'}\) set of movies both users have rated and \(\Psi_{ii'}=\Psi_i\cap\Psi_{i'}\). For some implementation considerations, if there are no common movies, those two users are not considered and counted into the weight. Also, can set a threshold to limit the number of neigborhoods for weighting since it can reduce the computing time. We can use KNN in this case. At the end, the personalized score is computed \(s_{ij}\).


Item-Item Collaborative Filtering

The idea of this filtering is to harness the correlation between items, instead of users. Item Correlation is formulated by.

$$w_{jj'}=\frac{\sum_{i\in\Omega_{jj'}}(r_{ij}-\bar{r}_j)(r_{ij'}-\bar{r}_{j'})}{\sqrt{\sum_{i\in\Omega_{hh'}}(r_{ij}-\bar{r}_j)^2}\sqrt{\sum_{i\in\Omega_{jj'}}(r_{ij'}-\bar{r}_{j'})^2}}$$

where \(\Omega_j\) users who rated item j, \(\Omega_{jj'}\) users who rated item j and item j', and \(\bar{r}_j\) average rating for item j.

Item score is written as,

$$s_{ij} = \bar{r}_j+\frac{1}{\sum_{j'\in\Psi_i}|w_{jj'}|}\sum_{j'\in\Psi_i}w_{jj'}(r_{ij'}-\bar{r}_{j'})$$

where \(\Psi_i\) items user i has rated.

UUCF: choose items for an user because those items have been liked by similar users. IICF: choose items because the user has liked similar items in the past.

References