Mismatch between ESM2 pretraining dataset and SaProt pretraining dataset

Hi there, thank you for your impressive work.

I downloaded the pretraining dataset that you published here: https://huggingface.co/datasets/westlake-repl/AF2_UniRef50

However when I load the DB I find that the validation set contains only ~20k uniprots. Your paper says:

"B PRE-TRAINING DATA PROCESSING We adhere to the procedures outlined in ESM-2 Lin et al. (2022) to generate filtered sequence data, and then we retrieve all AF2 structures via the AlphaFoldDB website https://alphafold. ebi.ac.uk/ based on the UniProt ids of protein sequences, collecting approximately 40 million structures."

And ESM-2 Lin et al. (2022) says:

"A. Materials and Methods A.1. Data A.1.1. SEQUENCE DATASET USED TO TRAIN ESM-2 UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set."

So I am wondering whether there is some data missing (i.e., the remaining ~240k validation uniprots) or if I have done something wrong.

Many thanks in advance.

westlake-repl / SaProt

Mismatch between ESM2 pretraining dataset and SaProt pretraining dataset #27