Open avilella opened 2 months ago
Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:
MEEV
and the structural sequence is ddvv
, then the SA sequence is Md Ed Ev Vv
. By doing so, you could construct a SA training set for your 500,000 sequences.The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process.
Thanks, I'll have a look at ColabSaprot. My 500,000 protein sequences are part of a corpus that hasn't been seen by any model, but I could use AF2 or similar to generate 3D models for them. We don't have empirical data for the fitness, only the protein sequences, but this corpus of data hopefully will modify the existing models enough so that the answers are not biased by the species that are most represented, e.g. human or mouse. Hopefully that makes sense.
If you don't have experimental labels for the fitness, you could predict the mutational effect in a zero-shot manner. In this case, you don't have to further tune the model and could directly make predictions for interested mutations. ColabSaprot provides a specific module for doing so (see this part 3.2), or you can run the provided code to make prdiction (see this part).
Even the model didn't see those protein sequences during training, I think it is capable of predicting the changed fitness to some degree. Hopy you could try it out and advance your research:)
Hi, typically this is a regression task, i.e. inputting a protein sequence to the model and getting the output value about the fitness effect. If your 500,000 protein sequences are derived from the same wild type protein, then a normal pipeline to fine-tune SaProt would be:
- Retrieve the protein structure of the wild type protein, normally from AlphaFold2 and encode it to structural sequence using foldseek.
- Construct structure-sware (SA) sequences by combining your protein sequences with the structural sequence. For instance, if the protein sequence is
MEEV
and the structural sequence isddvv
, then the SA sequence isMd Ed Ev Vv
. By doing so, you could construct a SA training set for your 500,000 sequences.- Fine-tune SaProt. SaProt could be initialized by huggingface interface. You could initialize SaProt with a regression head and then fine-tune it on your dataset.
The steps above constitute a normal pipeline of fine-tuning SaProt on your own dataset. It might be complicated for people who are not very familiar with ML techniques. Alternatively we recommend you use ColabSaprot to train your own model with only few clicks, see here. By using ColabSaprot you only have to upload your dataset and the system will automatically train the model on your data. We will also plot the training curve so you can track the training process.
I have a quick question—if we want to fine-tune SaProt with our own labeled data, how should we prepare the .mdb file? The .mdb files on the website seem to be password-protected, so we can't access the data structure. Could you provide a non-password-protected version, for instance, for the thermostability dataset? Thanks in advance!
Hi, you could refer to this issue https://github.com/westlake-repl/SaProt/issues/16 for some details.
Hi, you could refer to this issue #16 for some details.
Thanks for the timely reply. Good day!
Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks
Hello, extensive use of Colab's GPU requires some tricks and additional funding. I primarily conduct wet lab experiments and am not very strong in machine learning. Could you please advise how to convert a CSV file into the "foldseek" folder and the "normal" folder with MDP files for local training, similar to what ColabSaprot does? Is there a detailed tutorial available?thanks
Hi, if you have local gpus for training, you could deploy ColabSaprot on your local server without using google cloud. Here is the quick tutotial for your deployment: https://github.com/westlake-repl/SaprotHub/tree/main/local_server
Thanks for the timely reply.
Hello, can I post training SaProt with my own protein sequences? Not fine-tuning.
Hi, I have a corpus of about 500,000 protein sequences and would like to apply them to existing models like ESM2 or this one for predicting the fitness effect of changing an amino-acid for another. How could I add my sequences to the models referred in this repo to then use the modified model for such task? Thanks.