Closed ItamarChinn closed 5 months ago
Hi, thank you for your interest in our work!
We strictly followed ESM-2 to construct our training set. But for validation set, we didn't use that many proteins to track the model training, as it would reduce the training efficiency. Thereby We randomly sampled 20K proteins for validation and we found it could also well reflect the model convergence.
Hope this could resolve your problem! :)
Hi there, thank you for your impressive work.
I downloaded the pretraining dataset that you published here: https://huggingface.co/datasets/westlake-repl/AF2_UniRef50
However when I load the DB I find that the validation set contains only ~20k uniprots. Your paper says:
And ESM-2 Lin et al. (2022) says:
So I am wondering whether there is some data missing (i.e., the remaining ~240k validation uniprots) or if I have done something wrong.
Many thanks in advance.