westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
323 stars 32 forks source link

The length of the Metal Ion Binding dataset is different from that in the paper #21

Closed mikochou closed 6 months ago

mikochou commented 6 months ago

Hello, very nice and inspiring work!

However, I noticed that the size of Metal Ion Binding dataset is smaller than the number in the paper. In paper, the dataset size is (valid:1066, test:1083),but in the lmba file uploaded, the dataset size is (valid:664, test:667)

How could I get complete data?

Thank you!

LTEnjoy commented 6 months ago

Hi, thank you for being interested in our work!

The Metal Ion Binding dataset with valid set of 1066 and test set of 1083 samples is an early version, which is not clustered by sequence identity. The lmdb file uploaded is the final data on which we test all baselines. We have updated our paper here. In the latest preprint, we revised the size of datasets to match the real situations.

Hope this could resolve your problem!

mikochou commented 6 months ago

Thank you for your reply!