westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
323 stars 32 forks source link

Back Translate 3Di Tokesn to PDB Format #11

Closed mahdip72 closed 8 months ago

mahdip72 commented 9 months ago

Hi there. I do not have a great understanding of 3d structure formats and my questions might be very basic to others: Can we convert the 3Di tokens produced by foldseek to the actual 3D PDB files? I want to develop a model to get a protein sequence and returns their 3Di tokens as its 3D structure. Imagine that it would be a perfect model in prediction. How can I use the model in real life for 3D structure prediction?? Should I convert the outputs to a specific format?? How can I evaluate the 3D structure prediction of the model with respect to the true 3D structure?

Many thanks to whom can answer my questions.

LTEnjoy commented 9 months ago

Hello, interesting questions!

>Can we convert the 3Di tokens produced by foldseek to the actual 3D PDB files? I think the answer should be "No". The 3Di tokens are generated just based on several features extracted between two amino acids such as distances, torsion angles and therefore they lose more detailed information, so it's hard to recover the complete 3D PDB file given a 3Di sequence. If you want to have a deep look about how to compress and recover PDB files, I recommend the paper Foldcomp (https://www.biorxiv.org/content/10.1101/2022.12.09.519715v1.full.pdf)

>I want to develop a model to get a protein sequence and returns their 3Di tokens as its 3D structure. That's actually what ProstT5 did. I recommend you check this paper https://www.biorxiv.org/content/10.1101/2023.07.23.550085v1 for more details.

>How can I evaluate the 3D structure prediction of the model with respect to the true 3D structure? I think one of the convincing metrics would be TMscore. By calculating the TMscore between the predicted structure and the true 3D structure you could know to which degree your model can accurately predict protein structures.

I hope the answers could resolve your questions.

mahdip72 commented 9 months ago

Thank you so much for answering the questions!

mahdip72 commented 9 months ago

@LTEnjoy I have additional question regarding the fold seek 3Di tokens. How can I compare two sequence of predicted and true 3Di tokens? Can I use TM Score for that? I am working on a model to predict 3Di tokens and I am searching for the best metric to evaluate the model.

LTEnjoy commented 9 months ago

I think you could take a reference from the original Foldseek paper https://www.nature.com/articles/s41587-023-01773-0. They built foldseek to enable fast and accurate alignment between structures in the way for sequences. Maybe you could use some metrics from sequence alignment, or simplest the accuracy whether predicted token is identical with true token at each position?

mahdip72 commented 8 months ago

Thanks again.