westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
323 stars 32 forks source link

How can I use Saprot to predict the mutational effects on my own protein? #36

Closed AaranWang closed 3 months ago

AaranWang commented 3 months ago

Should I use the predictive mutational effect script directly, or should I first fine-tune Saprot with my own mutation data and then use the fine-tuned version? Thank you.

LTEnjoy commented 3 months ago

Hi,

If you have sufficient mutation data, you can fine-tune Saprot on the data and make prediction using the regression output head (not in zero-shot manner). The base model provides the ability to predict mutational effect in a zero-shot way, which means you don't have to further fine-tune Saprot.

We highly recommend you use SaprotHub to effortlessly predict the mutational effects with few clicks, see here.

AaranWang commented 3 months ago

How can I use the regression output head to predict mutations? Is there a tutorial or reference available? Thank you.

LTEnjoy commented 3 months ago

It might be complicated to code for a complete pipeline. You have to process your own data, fine-tune your model and make inference. SaprotHub also allows to fine-tune Saprot with your private data without any machine learning background, and we have provided detailed instructions for you to process step by step with few clicks, see here.

AaranWang commented 3 months ago

Thank you, I'll try.

AaranWang commented 3 months ago

I understand the complexity of completely fine-tuning the model on my own data. I'll try SaprotHUB later, but I believe that learning and gaining experience in fine-tuning this model is essential for me to enter this field. Could you provide me with some advice on where to begin? Thank your very much.

LTEnjoy commented 3 months ago

Sure! We have provided a simple example to fine-tune Saprot on Thermostability task, see here. If you want to fine-tune Saprot with your own data, please first convert your data into the LMDB format (see the LMDB files for Thermostability as reference). Then you can modify the config file to replace the dataset path with your data path.

1718179507400

Finally you can run the script to start training.

AaranWang commented 3 months ago

Thank you for your kind reply. My data is mutation data, similar to the ProteinGym format, and differs from thermostability data. Is there any additional processing I need to do? Or can I add you on WeChat? I might need to bother you with similar questions in the future. Thanks.

LTEnjoy commented 3 months ago

No problem. You can email your account to me and I'll add you on the WeChat for further discussion.

AaranWang commented 3 months ago

Thank you very much.