Training PepLand from scratch with new tokenization

mgarort commented 7 months ago

Dear PepLand team,

Thanks once again for your great paper and model!

I am trying to apply the pretrained model to my own peptides to obtain embeddings. However, I get the error message Warning Unfound fragments, which (I believe) means my peptides contain fragments/tokens not represented in PepLand's pretraining set, and therefore not represented in the vocabulary.

Could you please let me know how to train PepLand from scratch on my own training set with my own peptides, including tokenization, so that it considers all possible fragments in my peptides? The training section of the README specifies some of the scripts to be used, but I am unsure if this includes the tokenization step.

Thanks in advance.

zhangruochi commented 7 months ago

Dear PepLand team,

Thanks once again for your great paper and model!

I am trying to apply the pretrained model to my own peptides to obtain embeddings. However, I get the error message Warning Unfound fragments, which (I believe) means my peptides contain fragments/tokens not represented in PepLand's pretraining set, and therefore not represented in the vocabulary.

Could you please let me know how to train PepLand from scratch on my own training set with my own peptides, including tokenization, so that it considers all possible fragments in my peptides? The training section of the README specifies some of the scripts to be used, but I am unsure if this includes the tokenization step.

Thanks in advance.

Thank you very much for your interest in our work!

If possible, could you share a small sample of data that is causing errors in the program? This will help me identify where the issue lies and debug it effectively.

I have been preparing for my graduation defense recently, so I apologize for any delayed response in addressing this issue.

Thanks, Richard.

mgarort commented 7 months ago

Hi Richard,

Thanks a lot for your response and best of luck with your graduation defense!

Here is a sample of 5 smiles, some of which trigger the error message described. They are the smiles of 5 approved therapeutic peptides.

example_smiles.txt

zhangruochi commented 6 months ago

Hi Richard,

Thanks a lot for your response and best of luck with your graduation defense!

Here is a sample of 5 smiles, some of which trigger the error message described. They are the smiles of 5 approved therapeutic peptides.

example_smiles.txt

Hello, I apologize for the delayed response.

I have added an examples folder. Specifically, in examples/models/pepland/inference.py, I have encapsulated two models:

FeatureExtractor: utilizing a pre-trained PepLand as the feature extractor.
PepLandPredictor: employing PepLand as the feature extractor, followed by an MLP as the predictor. This serves as an example of how to fine-tune PepLand as a property predictor on a specific dataset.

You can utilize examples/main.py to test these two types of models. I will provide a complete fine-tuning example in the future.

I also tested the examples you provided, and indeed, there were some warnings because some fragments in your examples are not in my vocab table. However, the program can handle this out-of-vocabulary situation, so it can still generate the embedding.

mgarort commented 6 months ago

Hi Richard,

Thanks for your reply. Indeed, I was able to create embeddings a few weeks ago. The issue is that the unrecognized fragments are very important to our dataset, so I would like to re-train PepLand (including recreating the vocabulary) so that the model can consider those fragments explicitly.

Is it possible to re-train PepLand from scratch on my peptides, including new tokenization / recreation of the vocabulary, so that all fragments in my dataset are recognized?

zhangruochi / pepland

Training PepLand from scratch with new tokenization #4