westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
323 stars 32 forks source link

Dataset overlap #40

Closed lhallee closed 3 months ago

lhallee commented 3 months ago

Hello,

It seems there is overlap between the sequences in some of your validation / test sets and the training set. I may have processed it wrong, could you confirm this result on your end? saprot_overlap

Best, Logan

LTEnjoy commented 3 months ago

Hi Logan,

This should be the problem of protein duplications in UniProt database. Proteins with different UniProt IDs could share the same sequences, e.g. P0CX56 and P0CX55. We created datasets based on the UniProt IDs.

Additionally, the datasets with PDB structures such as EC and GO provide samples with PDB IDs and chains, and we mapped PDB IDs and chains to corresponding UniProt IDs. It is possible that different PDB IDs and chains share the same UniProt ID, which causes the overlap, e.g. the protein Q92831 has different PDB IDs and chains 4NSQ-A and 1CM0-B. image

Nevertheless, the overlap only covers a small part of test set, and we use the datasets for all baselines. So this should not affect the conclusions.

lhallee commented 3 months ago

Gotcha, just wanted to point it out. Thanks!