Lower performance than paper

tongwu2020 commented 3 months ago

Hi authors,

Congrats on this great work. I try to run your code with "python run.py --model meta-llama/Llama-2-13b-hf", and I get

method auroc fpr95 tpr05 0 loss 54.9% 91.5% 3.9% 1 zlib 56.1% 89.2% 5.9% 2 mink_0.1 51.6% 92.8% 2.3% 3 mink_0.2 52.4% 93.6% 4.7% 4 mink_0.3 53.5% 92.8% 4.4% 5 mink_0.4 54.1% 92.0% 4.1% 6 mink_0.5 54.5% 91.5% 3.9% 7 mink_0.6 54.7% 91.0% 3.9% 8 mink_0.7 54.8% 90.7% 3.9% 9 mink_0.8 54.9% 91.3% 3.9% 10 mink_0.9 54.8% 92.3% 3.9% 11 mink_1.0 54.9% 91.5% 3.9% 12 mink++_0.1 60.8% 87.4% 6.2% 13 mink++_0.2 61.6% 84.1% 6.5% 14 mink++_0.3 61.5% 84.8% 5.4% 15 mink++_0.4 61.7% 83.5% 4.7% 16 mink++_0.5 61.5% 85.3% 5.4% 17 mink++_0.6 61.5% 85.9% 6.5% 18 mink++_0.7 61.7% 84.3% 7.2% 19 mink++_0.8 61.8% 85.3% 6.2% 20 mink++_0.9 61.7% 85.6% 5.2% 21 mink++_1.0 60.8% 84.6% 6.2%

On the paper, the auroc is more than 80%. I am not sure if I did something wrong. Thank you.

zjysteven commented 3 months ago

Hello,

Thank you for your interest. The paper reports LLaMA's results rather than LLaMA-2, whose HF paths are huggyllama/llama-13b, huggyllama/llama-30b, huggyllama/llama-65b. We didn't really try LLaMA-2 mainly because the training data ground-truth is much less clear than LLaMA (although it's likely that Wikipedia dumps are included).

The results you posted for LLaMA-2 look reasonable to me and I don't think there's anything "wrong". The exact reason why on LLaMA it can achieve 70-80%+ while ~60% on LLaMA-2 is hard to diagnose: I believe the ratio of different sources in the training data mixture, the training setup and many other factors can affect the result.

tongwu2020 commented 3 months ago

Thanks for your quick reply. Really appreciated.

zjysteven commented 3 months ago

In case needed, I added the HF paths of all evaluated models in README.

tongwu2020 commented 3 months ago

Yes, I think that would be really helpful. 👍

zjysteven / mink-plus-plus

Lower performance than paper #2