westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
323 stars 32 forks source link

Clarification Needed on Data Types #10

Closed Finnyear closed 10 months ago

Finnyear commented 10 months ago

Hello, I am currently exploring your project and have two questions regarding the data types:

  1. Raw Data Composition: Does the raw data provided in your repository include 3D coordinates? I check some of the mdb data, only find amino acid sequences with 3D structure embedding sequences. Can you provide raw data with 3D coordinates?
  2. Data Utilization in Downstream Tasks: In the context of downstream tasks, are you primarily using the real data, or are you employing data generated by AlphaFold? I find all data with the PLDDT. Answering these questions will help me a lot. Thanks in advance.
LTEnjoy commented 10 months ago

Hi, thank you for being interested in our work!

  1. Raw Data Composition We included 3D coordinates during evaluation of various baselines but we didn't upload this kind of datasets because of large memory limitation (saving 3D coordinates requires much larger memories than just protein sequences). If you are interested in the datasets containing 3D coordinates, please email me for further discussion.

  2. Data Utilization in Downstream Tasks We leveraged AF2 predicted structures for a series of downstream tasks including zero-shot prediction and supervised fine-tuning. We also compared SaProt's performance on both PDB and AF2 structures (please see Section 5.2 for more details).