westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
323 stars 32 forks source link

Mismatch between ESM2 pretraining dataset and SaProt pretraining dataset #27

Closed ItamarChinn closed 5 months ago

ItamarChinn commented 5 months ago

Hi there, thank you for your impressive work.

I downloaded the pretraining dataset that you published here: https://huggingface.co/datasets/westlake-repl/AF2_UniRef50

However when I load the DB I find that the validation set contains only ~20k uniprots. Your paper says:

"B PRE-TRAINING DATA PROCESSING We adhere to the procedures outlined in ESM-2 Lin et al. (2022) to generate filtered sequence data, and then we retrieve all AF2 structures via the AlphaFoldDB website https://alphafold. ebi.ac.uk/ based on the UniProt ids of protein sequences, collecting approximately 40 million structures."

And ESM-2 Lin et al. (2022) says:

"A. Materials and Methods A.1. Data A.1.1. SEQUENCE DATASET USED TO TRAIN ESM-2 UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set."

So I am wondering whether there is some data missing (i.e., the remaining ~240k validation uniprots) or if I have done something wrong.

Many thanks in advance.

LTEnjoy commented 5 months ago

Hi, thank you for your interest in our work!

We strictly followed ESM-2 to construct our training set. But for validation set, we didn't use that many proteins to track the model training, as it would reduce the training efficiency. Thereby We randomly sampled 20K proteins for validation and we found it could also well reflect the model convergence.

Hope this could resolve your problem! :)