Data for downstream tasks

RaulFD-creator commented 4 months ago

Hi @zhangruochi, I've found your paper quite interesting and I think the tool is really useful. I'm considering utilising the model for one project and wanted to play a bit with it. I've been looking for your evaluating datasets (canonical and non-canonical CPP, solubility, and canonical and non-canonical binding affinity). I'd appreciate if you could share both the datasets and training/validation/testing splits used in the paper, so that I can better compare directly against your experiments.

Thanks beforehand! Kind regards,

zhangruochi commented 3 months ago

Hi @zhangruochi, I've found your paper quite interesting and I think the tool is really useful. I'm considering utilising the model for one project and wanted to play a bit with it. I've been looking for your evaluating datasets (canonical and non-canonical CPP, solubility, and canonical and non-canonical binding affinity). I'd appreciate if you could share both the datasets and training/validation/testing splits used in the paper, so that I can better compare directly against your experiments.

Thanks beforehand! Kind regards,

Hi, sorry for reply lately,

I have already uploaded all the evaluation datasets we collected to the data/eval folder. Please check. Thank you very much. I would also appreciate it if you could follow our latest articles, which are currently under review.

RaulFD-creator commented 3 months ago

Hi @zhangruochi, thanks for your response, I really appreciate the help. I'm looking at the files and I have a doubt about the setup for the binding affinity datasets, do you generate the embeddings for the target protein with PepLand (as if it were a really big peptide) or do you use a pre-trained PLM (like ESM or ProtBERT)?

Also, did you use any particular protocol for creating the training/testing splits or just the random split from scikit-learn?

I will definitely continue checking your articles, it seems like you are doing really interesting research.

zhangruochi commented 3 months ago

Hi @zhangruochi, thanks for your response, I really appreciate the help. I'm looking at the files and I have a doubt about the setup for the binding affinity datasets, do you generate the embeddings for the target protein with PepLand (as if it were a really big peptide) or do you use a pre-trained PLM (like ESM or ProtBERT)?

Also, did you use any particular protocol for creating the training/testing splits or just the random split from scikit-learn?

I will definitely continue checking your articles, it seems like you are doing really interesting research.

Sorry for the confusion in the preprint. We have described this in detail in our formally submitted paper:

For peptide-specific properties like cell-penetrating ability and solubility, we used our PepLand model to extract peptide features. For properties related to protein-peptide interactions, such as affinity prediction, we utilized ESM2_t12 to extract protein features. To ensure fairness in our evaluations, the protein feature extraction model remained consistent across all comparisons.
To evaluate the quality of the features derived from different pre-trained peptide models, we employed linear probe technology. This method uses the pre-trained model as a feature extractor to produce feature representations for a given set of labeled examples. A linear classifier or regressor is then trained on these features. This approach assumes that good features should achieve satisfactory linear separation of classes or a linear relationship in regression in downstream tasks, and it allows for the evaluation of feature quality independent of model architecture.

zhangruochi commented 3 months ago

I will definitely continue checking your articles, it seems like you are doing really interesting research.

Regarding the division of the training set, validation set and test set, it is done randomly.

Random division can indeed lead to some data leakage in some practical scenarios.
However, since we used the same division method when comparing with all baseline models, it is still reasonable.

RaulFD-creator commented 3 months ago

Perfect, thanks a lot for the clarifications! Looking forward to reading the final peer-reviewed version.

zhangruochi / pepland

Data for downstream tasks #6