ys-zong / MEDFAIR

[ICLR 2023 spotlight] MEDFAIR: Benchmarking Fairness for Medical Imaging
https://ys-zong.github.io/MEDFAIR/
56 stars 10 forks source link

Papila-Age and Papila-Sex csv generated for three different model selections seems to be similar #5

Closed pearlmary closed 1 year ago

pearlmary commented 1 year ago

After, I did results_analysis.ipynb, the results generated only for papila dataset ( papila-age.csv, papila-sex.csv) seems to be looking same without any changes for ''pareto, overall_auc and DTO" selections. Also the rank and cd diagrams generated seems to be similar and not correct.

Am I going wrong in any places in the code? Can you please check it once.

Thanks

ys-zong commented 1 year ago

I guess this is expected if you have only experimented with PAPILA so far. The CD diagrams in the paper are plotted after calculating the ranks of all the datasets & sensitive attributes (that's also one of the things where the paper value lies--conclusions from a large number of datasets). So, you may not observe a significance only with one dataset. Also, you can do more runs with random searches to observe the differences between different model selection strategies.

pearlmary commented 1 year ago

Hi Zong,

Thanks for the reply. By more runs, where and what do you exactly mean? little can you elaborate as how to observe differences between different model selection strategies?

ys-zong commented 1 year ago

It means 1) the CD diagrams are calculated after calculating/averaging the ranks of all the datasets & sensitive attributes, i.e. not only PAPILA but also HAM10000, CheXpert, etc. 2) for each dataset, we do ~20-40 runs with different hyperparameters. And the model selection tries to select the model from these different-hyperparameter runs with specific criteria (DTO, Pareto optimal, etc). It's possible you don't observe significant differences from one dataset. The differences are observed after running all of the experiments.