westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
271 stars 25 forks source link

Guideline for fine-tuning a pre-trained model with specific protein thermal stability data #39

Closed biowander closed 2 weeks ago

biowander commented 2 weeks ago

Hello! Great job! I'm a newcomer in this field, and I would like to ask for your advice. Is it possible to use your pre-trained model and fine-tune it with our own data, which consists of tens of thousands of diverse proteins, each with its own thermal stability value, to generate a model specifically for predicting protein thermal stability? My data is not based on mutations; it includes many diverse proteins, each with a stability value. Can you provide a general workflow?

LTEnjoy commented 2 weeks ago

Hello, thank you for your interest in our work!

It might be complicated to code for a complete fine-tuning pipeline. You have to process your own data, fine-tune your model and make inference. We highly recommend you use SaprotHub to effortlessly fine-tune SaProt with few clicks, see here.

If you still want to mannually fine-tune SaProt, we have provided a simple example to fine-tune Saprot on Thermostability task, see here. Please first convert your data into the LMDB format (see the LMDB files for Thermostability as reference). Then you can modify the config file to replace the dataset path with your data path. image

Finally you can run the script to start training.

biowander commented 2 weeks ago

Thank you very much for your answer!