zhengruifeng / spark-libFM

An implement of Factorization Machines (LibFM)
Apache License 2.0
248 stars 119 forks source link

issue found with cross term evaluation #1

Open kgierach opened 9 years ago

kgierach commented 9 years ago

Hi Ruifeng,

I have run a dataset thru your FM engine, which I generated, and contains specific cross term relationships. With Rendle’s libfm project I am getting the expected results, however, when using your library, I am only getting noise, or rather false positives. Can you provide some usage tips, or would you be willing to run thru the data set and look for possible errors somewhere in the code?

I am using the Gradient Descent optimization algorithm.

All individual weights come out as zero, and I have tried using different values for the learning rate as well. However I’m not actually concerned with the individual weights, and I’m considering this to be a symptom of the underlying problem.

My concern is to locate outlier cross terms in the data. With your library I don’t get a single expected cross term, but with libfm I get all the cross terms, and only a handful of false postives as well.

Here’s my expected cross – term list: 1,7
3,9
5,10
6,12
14,15
16,17
19,20

My method of finding the cross term is to: Take the output matrix of the model F C = F * F_transpose Use C to lookup the terms of interest by striping by row, compute mean and variance, assume a normal distribution and look for the upper terms exceeding a threshold. If no terms are found then I decrease the threshold gradually to a point until I either find some “outliers” or find none. Examine the list of "outliers" for my cross terms of interest, and I don't care about the order. Let me re-state that using the same method works using Rendle’s libfm engine. I have tried replicating the algorithm parameters that I used in his library as well, when running with your code in Spark.

Thank you, Karl