Repro Difficulties - Githubissues

mgwillia commented 3 years ago

Hi,

I've been trying to reproduce your results as part of something I'm working on. Some have turned out almost identical (CIFAR-10), but I've had a difficultly with CIFAR-20 and STL-10. Basically, I'm unable to exactly match the reported result for SCAN, and for the self-label step it actually causes results to get slightly worse. I've tried to double-check settings; I had originally adjusted things like batch size but I reset those for the most part and there is still some discrepancy.

Note that I'm training SimCLR from scratch using the code and configs in this repository.

Do you have any insight as to what may be the cause of this?

Also, for the ImageNet-50, ImageNet-100, and ImageNet-200 experiments, are you using a MOCO that was trained on all of ImageNet, both train and val? I am not able to come close to your results, but that would make the difference (I had trained a SimCLR on the train portion only of each subset).

wvangansbeke commented 3 years ago

Hi @mgwillia,

Thank you for your interest.

We reported the average across 10 runs in the paper. Last time I ran it I got 80% on STL10 (check the log file). When you run SimCLR on CIFAR100, there might be some variation, because the train set is quite small (per class) compared to CIFAR10. The results were obtained with the config files in this repo. From my experience the variance is quite low though. The configuration is important, so make sure it's the same.

Not sure what the issue is with ImageNet. We simply used the checkpoint from the MoCo repo. Please check moco.py. Only the trainset is used in order to have fair comparisons with semi-supervised learning in the paper. Results are also averaged over multiple splits (see paper) and from our experience very robust. You will sometimes even get better results here as well. So my guess is that you're not using the correct pretrained weights. FYI, I've recently seen other papers reporting very similar numbers (using our code base) so I'm sure that they are correct.

mgwillia commented 3 years ago

Just a disclaimer: I'm not doubting the correctness of the numbers, more so a potential shortcoming in my own understanding or implementation.

STL10: My main question was why the self-label step would cause performance to degrade. My SimCLR results for STL are actually better than the checkpoint here (and the trainset there isn't small, since you use train+val), and my SCAN results are within the standard deviation (but on the lower end). It's the self-label that isn't matching, and I've checked and confirmed the configs match exactly. Did you ever observe selflabel degrade instead of improve performance?

ImageNet: When you say "we simply used the checkpoint from the MoCo repo" does that mean that the same MoCo backbone is used for every subset, rather than a separate backbone only trained on data for that subset? This is what I would assume, and it would explain the numbers reported, but it is not clear just from reading the paper or the code.

wvangansbeke commented 3 years ago

Yes for STL-10 it can drop a little during self-labeling because the train set is small (only 500 images). However, it should be close to on par from what I can see. For CIFAR20 the ground truth classes are quite ambiguous, so I would not focus on this dataset too much. ImageNet and STL-10 are more reliable.

ImageNet: See line 65 in moco.py for the link to the used pretrained weights. This is the model from MoCo trained on the ImageNet trainset for 800 epochs.

I'm just trying to help. Good luck and I hope it's clear now.

mgwillia commented 3 years ago

STL-10: I see a pretty consistent 3-4% drop across 3 different trials for STL going from SCAN to self-label. I guess the intuition would be that self-labeling can fail because the confident predictions are too noisy? I'm a little surprised and wish I knew more about how you were getting it to increase sometimes.

ImageNet: You may want to consider clarifying this detail in the paper; while it became clear to me once I started running things myself, it is easy for the reader to assume that ImageNet50 was treated as a first-class dataset, and thus had it's own SSL backbone trained when in reality it piggy-backs off of ImageNet1k.

wvangansbeke commented 3 years ago

STL10: This is plausible. Since you might have a few (over)confident samples for a certain class, it can cause the model to not generalize very well for that particular class. If you have more data and more prototypical examples, this will be less of an issue. How to solve this? Try fixing the backbone weights and using multiple heads (see ImageNet configs). This should prevent degradation and should give slight improvements. However, this makes it more complicated as well.

ImageNet: I'm happy it's clear now. While I can see your point, it should be clear from the comments in the code and from the README. Paper was published a year ago, so I can't make any changes in the paper. Keep in mind that we provide the results on the complete ImageNet dataset (1000 classes) as well. Anyway, I might add a note to the README to clarify your doubts. Thank you for the feedback.

mgwillia commented 3 years ago

Thanks for answering my questions, and I really appreciate this codebase. It has been, for the most part, very easy to read, understand, navigate, and extend.

wvangansbeke / Unsupervised-Classification

Repro Difficulties #69