westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
271 stars 25 forks source link

ClinVar query #9

Closed nrafaili closed 7 months ago

nrafaili commented 7 months ago

Hello, I downloaded the ClinVar .tar.gz file from your directory. I noticed that all of the fitness values are '1.0'. The ProteinGym dataset reports various fitness values. Is there a reason you have only kept the '1.0' fitness score ones ?

LTEnjoy commented 7 months ago

Hi!

We set all fitness values to "1.0" because there is no need to use the fitness variable for ClinVar dataset. We adopt Spearman's ρ as evaluation metric for ProteinGym and global AUC for ClinVar. So we only have to record the evolutionary index of each mutation for AUC calulation. Therefore we just randomly set a default value to all fitness values, i.e. 1.0.

You could calculate the AUC value through the below code:

# Evaluate the zero-shot performance of SaProt on the ClinVar benchmark
python scripts/mutation_zeroshot.py -c config/ClinVar/saprot.yaml
python scripts/compute_clinvar_auc.py -c config/ClinVar/saprot.yaml
nrafaili commented 7 months ago

Awesome, thank you for the prompt response !