westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
271 stars 25 forks source link

Fine-tune SaProt on the Thermostability task #38

Closed qiyifei1 closed 2 weeks ago

qiyifei1 commented 2 weeks ago

Hello, the Thermostability dataset seems to contain only protein sequence, but not 3Di sequence. Here is one entry:

name Q9NQ94
chain A
seq MESNHKSGDGLSGTQKEAALRALVQRTGYSLVQENGQRKYGGPPPGWDAAPPERGCEIFIGKLPRDLFEDELIPLCEKIGKIYEMRMMMDFNGNNRGYAFVTFSNKVEAKNAIKQLNNYEIRNGRLLGVCASVDNCRLFVGGIPKTKKREEILSEMKKVTEGVVDVIVYPSAADKTKNRGFAFVEYESHRAAAMARRKLLPGRIQLWGHGIAVDWAEPEVEVDEDTMSSVKILYVRNLMLSTSEEMIEKEFNNIKPGAVERVKKIRDYAFVHFSNREDAVEAMKALNGKVLDGSPIEVTLAKPVDKDSYVRYTRGTGGRGTMLQGEYTYSLGQVYDPTTTYLGAPVFYAPQTYAAIPSLHFPATKGHLSNRAIIRAPSVREIYMNVPVGAAGVRGLGGRGYLAYTGLGRGYQVKGDKREDKLYDILPGMELTPMNPVTLKPQGIKLAPQILEEICQKNNWGQPVYQLHSAIGQDQRQLFLYKITIPALASQNPAIHPFTPPKLSAFVDEAKTYAAEYTLQTLGIPTDGGDGTMATAAAAATAFPGYAVPNATAPVSAAQLKQAVTLGQDLAAYTTYEVYPTFAVTARGDGYGTF
fitness 41.9455665914228

So the finetune script 'python scripts/training.py -c config/Thermostability/saprot.yaml' does not use 3Di token information, right?

But in the "AlphaFold2 vs. ESMFold" table, the results apparently use structural information. Is it possible to provide the Thermostability dataset with 3Di tokens?

LTEnjoy commented 2 weeks ago

Hello,

Did you download the Thermostability dataset from here and use the lmdb data from the foldseek dir? We have provided the datasets for both ESM-2 and SaProt, named normal and foldseek respectively. image

If you want to fine-tune SaProt you need to load data from the foldseek dir.

qiyifei1 commented 2 weeks ago

Got it, thank you.