how is robustness calculated?

psteinb commented 2 years ago

Hi,

thank you for this wonderful work on vision transformers and how to understand them. I have some simple questions which I must apologize for. I tried to reproduce figure 12 independently of your code base. I struggle a bit to understand the code. Is is correct that you define robustness as robustness = mean(accuracy(y_val_true, y_val_pred))? Related to this, do I understand correctly that you compute this accuracy on batches of the validation dataset? These batches are of size 256, right?

Thanks.

xxxnell commented 2 years ago

Hi, thank you for your support!

CIFAR-{10, 100}-C and ImageNet-C consist of 75 datasets (= data corrupted by 15 different types with 5 levels of intensity each). The robustness in this paper is the average of the accuracies on these 75 corrupted datasets.

In particular, I recommend that you measure the robustness as follows:

Run all cells in robustness.ipynb to get predictive performances of a pretrained model on the 75 datasets. CIFAR-{10, 100}-C will be automatically downloaded. Then, you will get a performance sheet like the sample robustness sheet.
Average all accuracies for the 75 datasets. In the robustness sheet, the columns stand for "Intensity", "Type", "NLL", "Cutoff1", "Cutoff2", "Acc", "Acc-90", "Unc", "Unc-90", "IoU", "IoU-90", "Freq", "Freq-90", "Top-5", "Brier", "ECE", "ECSE”, respectively. We only use the accuracy column ("Acc").

To avoid confusion: rigorously, we do not use the following types of datasets for evaluation: "speckle_noise", "gaussian_blur", "spatter", "saturate". Another metric called mCE (which does not used in this paper) is also used for robustness.

The batch size is 256 by default, but I believe the robustness is independent of the batch size.

xxxnell commented 2 years ago

Closing this issue based on the comment above. Please feel free to reopen this issue if the problem still exists.

psteinb commented 2 years ago

Sure thing, please close the issue. I think it would be great to have access to the intermediate results to (re-)produce the robustness numbers. I fancied in the robustness notebook that I'd have to retrain all cited models (as I cannot honor models.load(name, ...) in my environment) and (to be honest) didn't want to invest the CO2 for this.
But maybe the .pth checkpoints are available for download and I misread the docs. Please accept my apologies if that is the case.

xxxnell commented 2 years ago

Thank you for your constructive feedback. I agree with your comments that releasing intermediate results would be helpful, because evaluating pretrained models on 75 datasets can be resource intensive. I will release robustness sheets as intermediate results for some models, and make the pretrained models easily accessible.

xxxnell / how-do-vits-work

how is robustness calculated? #5