Closed NicolasHug closed 8 years ago
Thank you for your PR. Can you look at line 29 and 30 in RatingCosine? Also, did you perform any measurements in terms of runtime and prediction quality? In particular, it would be interesting to see whether it is stronger than Pearson in some cases. Otherwise I am not sure whether it makes sense to add it.
Thanks for the feedback! Sorry about the indentation issue, I fixed it. I tested it against the 5-fold cross validation set of the Movielens 100k dataset, and performances are not significantly different:
I can't think of any usecase where RatingCosine would perform much better than Pearson, but it's still one of the most widely used similarity measures as far as I know (and it was marked as issue #304 so I gave it a shot :) )
Nicolas
You are right, there is still an issue -- I've lost sight of all those tickets ;-)
The thing is, except for the centering the two are essentially equal. One option would be to call the whole thing RatingCosine and add a centering option.
What do you think?
I am not quite sure yet, will sleep over it.
Let me know in case you have suggestions.
It's true that they're essentially the same. Merging the two by adding a centering option would be easy, but I'm afraid using Centered RatingCosine
instead of Pearson
might confuse unseasoned users.
As the only real difference is the computation of the numerator
and denominator
, another option to keep names different and base code identical would be to subclass both from an abstract mother class with virtual methods GetNum()
and GetDenom()
... But that would probably be overkill?
GetNumerator() and GetDenominator() are a good idea. Maybe not an abstract superclass, but RatingCosine, and Pearson is then a sub-class that just overrides those methods.
As I look at the code right now, there may be some more room for making it more compact. But that can be addressed later.
That would look better indeed.
Maybe subclassing RatingCosine from Pearson would make more sense? Variables i_sum
and j_sum
are only used in Pearson so they would be computed in RatingCosine with no obvious reason.
Makes sense.
On the other hand, if we want to share the rest of the code, the variables will be computed anyway.
I will merge this and then see, okay?
Yes they'll be computed anyway indeed, but that's not too costly I believe.
Ok then, thanks!
Hi,
I needed it so I implemented rating cosine similarity. The code is strictly based on what was done with the Pearson correlation.
Nicolas