westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
271 stars 25 forks source link

pretraining esm2_t33_650M_UR50D #24

Closed yufengwhy closed 3 months ago

yufengwhy commented 3 months ago

Could we kindly provide the config and data preprocessing script to reproduce the pretraining of esm2_t33_650M_UR50D ?

LTEnjoy commented 3 months ago

Hi!

We are sorry we cannot provide such script because all our data preprocessing aims to engage AF2 predicted structures into our training dataset. But ESM-2 leveraged only sequence data for pre-training. We recommend refering the original paper of ESM-2 for more details, including both training hyperparameters and dataset construction.

Best regards, Jin

yufengwhy commented 3 months ago

Could we use this to reproduce the pretraining of SaProt_650M_AF2? python scripts/training.py -c config/pretrain/saprot.yaml Got this error, any ideas?

huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'weights/PLMs/SaProt_650M_AF2'. Use repo_type argument if needed.

LTEnjoy commented 3 months ago

I guess that's because you didn't put the checkpoint at right directory. You could move the SaProt checkpoint to weights/PLMs and try again.