About vec5_CTC.txt: what is the basis for determining these vectors?

horacehht commented 1 year ago

Thanks you for making such a project public. At the same time, I have some issues about this project:

In the vec5_CTC.txt file, each amino acid have a corresponding vector. What is the basis for determining these vectors? Or is it according to the conventions of a previous project? Why not to use the Protein Language Models to get the amino acid embedding?

horacehht commented 1 year ago

I have another question. In the vec5_CTC.txt file, the dim of amino acid is 13. However, in 'x_list.pt', I found that the dim of amino acid is 7. There is a mismatch!

horacehht commented 1 year ago

Sorry to bother the authors! I found that author do the feature selection which is written in Supplementary Method 1.

horacehht commented 1 year ago

In the vec5_CTC.txt file, each amino acid have a corresponding vector. What is the basis for determining these vectors? Or is it according to the conventions of a previous project?

Actually according to the conventions of a previous project. I tried to reproduce the baseline results, so I read the PIPR paper and find that.

The amino acid embeddings are composed of two parts. One part is obtained by pretraining the skip-gram model on the SHS148k protein sequences, which is 5 dimensional vector. The other part is obtained by a categorization of electrostaticity and hydrophobicity for the amino acid, which is 8 dimensional vector.(but the original paper says 7 dimensional vector, doesn't matter)

All the details can be found in this baseline model paper PIPR, section4.5 Amino acid embeddings. Paper name: Multifaceted protein–protein interaction prediction based on Siamese residual RCNN

zqgao22 commented 1 year ago

Hello, Haitao, thanks for your interest. Yes, your interpretation is consistent with our approach. To conveniently introduce chemical information, we used seq2vec pre-trained embeddings that can be directly assigned to various amino acids. An advanced approach is to consider the pre-trained representations of ESM-2 in the pre-processing stage to increase the model generalization. We just noticed that you have raised some other questions. Please give us some time to check and reply to you. Thanks!

Best regards

zqgao22 / HIGH-PPI

About vec5_CTC.txt: what is the basis for determining these vectors? #11