zhanghaicang / carbonmatrix_public

93 stars 12 forks source link

a series of X #5

Open chengeng17 opened 2 weeks ago

chengeng17 commented 2 weeks ago

The reason why predicted sequences sometimes frequently contain a series of X?

zhanghaicang commented 1 week ago

The reason why predicted sequences sometimes frequently contain a series of X?

Thank you for reporting this issue. Could you please provide your input file so we can reproduce the problem?

chengeng17 commented 1 week ago

Great job! You can directly download the PDB file with the PDB ID: 8OOM. I have noticed some potential issues during the prediction process: Firstly, if the original PDB file contains ligands or other ligands, X may appear during reading because these may not be present in your amino acid matching dictionary. Secondly, it is essential to pay attention to the indexing of residue numbers. I have observed that if the residue number index is 0 or a negative number, X will appear at the end of the prediction, and the predicted sequence will not match. Additionally, if there are missing residue numbers, the missing part will also show X in the code you provided (possibly due to the mismatch in the mapping table for the missing amino acid). Therefore, I suggest preprocessing the PDB file before prediction, retaining only the coordinates of the main chain atoms and reordering the residue number index. Currently, by following this approach, I have been able to address this issue, significantly impacting the performance of the model's recovery.

CarbonMatrixLab commented 1 week ago

Great job! You can directly download the PDB file with the PDB ID: 8OOM. I have noticed some potential issues during the prediction process: Firstly, if the original PDB file contains ligands or other ligands, X may appear during reading because these may not be present in your amino acid matching dictionary. Secondly, it is essential to pay attention to the indexing of residue numbers. I have observed that if the residue number index is 0 or a negative number, X will appear at the end of the prediction, and the predicted sequence will not match. Additionally, if there are missing residue numbers, the missing part will also show X in the code you provided (possibly due to the mismatch in the mapping table for the missing amino acid). Therefore, I suggest preprocessing the PDB file before prediction, retaining only the coordinates of the main chain atoms and reordering the residue number index. Currently, by following this approach, I have been able to address this issue, significantly impacting the performance of the model's recovery.

Thank you. We are looking into this issue and will resolve it as soon as possible.

CarbonMatrixLab commented 1 week ago

Great job! You can directly download the PDB file with the PDB ID: 8OOM. I have noticed some potential issues during the prediction process: Firstly, if the original PDB file contains ligands or other ligands, X may appear during reading because these may not be present in your amino acid matching dictionary. Secondly, it is essential to pay attention to the indexing of residue numbers. I have observed that if the residue number index is 0 or a negative number, X will appear at the end of the prediction, and the predicted sequence will not match. Additionally, if there are missing residue numbers, the missing part will also show X in the code you provided (possibly due to the mismatch in the mapping table for the missing amino acid). Therefore, I suggest preprocessing the PDB file before prediction, retaining only the coordinates of the main chain atoms and reordering the residue number index. Currently, by following this approach, I have been able to address this issue, significantly impacting the performance of the model's recovery.

In our paper, we preprocess structural data in the .cif format for the training and testing sets, as it allows for perfect mapping between sequence and structural numbers. However, for the pdb format, there are several issues that need to be addressed.

chengeng17 commented 1 week ago

Great job! You can directly download the PDB file with the PDB ID: 8OOM. I have noticed some potential issues during the prediction process: Firstly, if the original PDB file contains ligands or other ligands, X may appear during reading because these may not be present in your amino acid matching dictionary. Secondly, it is essential to pay attention to the indexing of residue numbers. I have observed that if the residue number index is 0 or a negative number, X will appear at the end of the prediction, and the predicted sequence will not match. Additionally, if there are missing residue numbers, the missing part will also show X in the code you provided (possibly due to the mismatch in the mapping table for the missing amino acid). Therefore, I suggest preprocessing the PDB file before prediction, retaining only the coordinates of the main chain atoms and reordering the residue number index. Currently, by following this approach, I have been able to address this issue, significantly impacting the performance of the model's recovery.

In our paper, we preprocess structural data in the .cif format for the training and testing sets, as it allows for perfect mapping between sequence and structural numbers. However, for the pdb format, there are several issues that need to be addressed.

Thank you for your reply. I look forward to you solving these existing problems.