implemented rating cosine similarity

zenogantner / MyMediaLite

recommender system library for the CLR (.NET)

http://mymedialite.net

499 stars 192 forks source link

implemented rating cosine similarity #453

Closed NicolasHug closed 8 years ago

NicolasHug commented 8 years ago

Hi,

I needed it so I implemented rating cosine similarity. The code is strictly based on what was done with the Pearson correlation.

Nicolas

zenogantner commented 8 years ago

Thank you for your PR. Can you look at line 29 and 30 in RatingCosine? Also, did you perform any measurements in terms of runtime and prediction quality? In particular, it would be interesting to see whether it is stronger than Pearson in some cases. Otherwise I am not sure whether it makes sense to add it.

NicolasHug commented 8 years ago

Thanks for the feedback! Sorry about the indentation issue, I fixed it. I tested it against the 5-fold cross validation set of the Movielens 100k dataset, and performances are not significantly different:

computation time is a slightly in favor of RatingCosine but by almost nothing. Training time is around 01.15s for RatingCosine and 01.50s for Pearson.
prediction accuracy is about the same as well. The mean RMSE is .937 for RatingCosine and .934 for Pearson.

I can't think of any usecase where RatingCosine would perform much better than Pearson, but it's still one of the most widely used similarity measures as far as I know (and it was marked as issue #304 so I gave it a shot :) )

Nicolas

zenogantner commented 8 years ago

You are right, there is still an issue -- I've lost sight of all those tickets ;-)

zenogantner commented 8 years ago

The thing is, except for the centering the two are essentially equal. One option would be to call the whole thing RatingCosine and add a centering option.

What do you think?

I am not quite sure yet, will sleep over it.

Let me know in case you have suggestions.

NicolasHug commented 8 years ago

It's true that they're essentially the same. Merging the two by adding a centering option would be easy, but I'm afraid using Centered RatingCosine instead of Pearson might confuse unseasoned users.

As the only real difference is the computation of the numerator and denominator, another option to keep names different and base code identical would be to subclass both from an abstract mother class with virtual methods GetNum() and GetDenom()... But that would probably be overkill?

zenogantner commented 8 years ago

GetNumerator() and GetDenominator() are a good idea. Maybe not an abstract superclass, but RatingCosine, and Pearson is then a sub-class that just overrides those methods.

As I look at the code right now, there may be some more room for making it more compact. But that can be addressed later.

NicolasHug commented 8 years ago

That would look better indeed. Maybe subclassing RatingCosine from Pearson would make more sense? Variables i_sum and j_sum are only used in Pearson so they would be computed in RatingCosine with no obvious reason.

zenogantner commented 8 years ago

Makes sense.

zenogantner commented 8 years ago

On the other hand, if we want to share the rest of the code, the variables will be computed anyway.

zenogantner commented 8 years ago

I will merge this and then see, okay?

NicolasHug commented 8 years ago

Yes they'll be computed anyway indeed, but that's not too costly I believe.

Ok then, thanks!