fine-tuning with more protein sequences

avilella commented 2 months ago

Hi, I have a corpus of about 500,000 protein sequences and would like to apply them to existing models like ESM2 or this one for predicting the fitness effect of changing an amino-acid for another. How could I add my sequences to the models referred in this repo to then use the modified model for such task? Thanks.

LTEnjoy commented 2 months ago

Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:

Retrieve the protein structure of the wild type protein, normally from AlphaFold2 and encode it to structural sequence using foldseek.
Construct structure-sware (SA) sequences by combining your protein sequences with the structural sequence. For instance, if the protein sequence is MEEV and the structural sequence is ddvv, then the SA sequence is Md Ed Ev Vv. By doing so, you could construct a SA training set for your 500,000 sequences.
Fine-tune SaProt. SaProt could be initialized by huggingface interface. You could initialize SaProt with a regression head and then fine-tune it on your dataset.

The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process.

avilella commented 2 months ago

Thanks, I'll have a look at ColabSaprot. My 500,000 protein sequences are part of a corpus that hasn't been seen by any model, but I could use AF2 or similar to generate 3D models for them. We don't have empirical data for the fitness, only the protein sequences, but this corpus of data hopefully will modify the existing models enough so that the answers are not biased by the species that are most represented, e.g. human or mouse. Hopefully that makes sense.

LTEnjoy commented 2 months ago

If you don't have experimental labels for the fitness, you could predict the mutational effect in a zero-shot manner. In this case, you don't have to further tune the model and could directly make predictions for interested mutations. ColabSaprot provides a specific module for doing so (see this part 3.2), or you can run the provided code to make prdiction (see this part).

Even the model didn't see those protein sequences during training, I think it is capable of predicting the changed fitness to some degree. Hopy you could try it out and advance your research:)

wangjiaqi8710 commented 2 months ago

Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:

Retrieve the protein structure of the wild type protein, normally from AlphaFold2 and encode it to structural sequence using foldseek.

Construct structure-sware (SA) sequences by combining your protein sequences with the structural sequence. For instance, if the protein sequence is MEEV and the structural sequence is ddvv, then the SA sequence is Md Ed Ev Vv. By doing so, you could construct a SA training set for your 500,000 sequences.

Fine-tune SaProt. SaProt could be initialized by huggingface interface. You could initialize SaProt with a regression head and then fine-tune it on your dataset.

The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process.

I have a quick question—if we want to fine-tune SaProt with our own labeled data, how should we prepare the .mdb file? The .mdb files on the website seem to be password-protected, so we can't access the data structure. Could you provide a non-password-protected version, for instance, for the thermostability dataset? Thanks in advance!

LTEnjoy commented 2 months ago

Hi, you could refer to this issue https://github.com/westlake-repl/SaProt/issues/16 for some details.

wangjiaqi8710 commented 2 months ago

Hi, you could refer to this issue #16 for some details.

Thanks for the timely reply. Good day!

wangjs188 commented 2 months ago

Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks

LTEnjoy commented 2 months ago

Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks

Hi, if you have local gpus for training, you could deploy ColabSaprot on your local server without using google cloud. Here is the quick tutotial for your deployment: https://github.com/westlake-repl/SaprotHub/tree/main/local_server

wangjs188 commented 2 months ago

Thanks for the timely reply.

har77774 commented 1 week ago

Hello, can I post training SaProt with my own protein sequences? Not fine-tuning.

westlake-repl / SaProt

fine-tuning with more protein sequences #53