Different number of Sequences in Tree and Fasta

wtporter commented 8 months ago

Hi, Comparing the public-latest.all.fa sequences to the nwk and tsv metadata file and it appears that there is a discrepancy within the sample numbers. Within the .fa file there are ~6.6 million and the tsv and tree have ~8.3 million sequences. Is the fasta reduced to just unique sequences or is there an issue preventing all ~8.3 million sequences from being written in the fasta? http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ Thanks for this great resource!

AngieHinrichs commented 8 months ago

Thanks for pointing out the discrepancy @wtporter, I will look into it.

AngieHinrichs commented 8 months ago

I've been adding new public sequences from the daily build to the public MSA, but that misses quite a few sequences over time because sometimes a new sequence is available from GISAID earlier than from public repo like GenBank, so the GISAID version of the sequence is aligned to reference and added to the tree -- and then later, when the public version becomes available, it is renamed in the tree instead of being aligned & added. So I needed to round up 1.7 million missing sequences, align them and add them to the MSA.

I have done that for the 2023-10-30 tree:

http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2023/10/30/

Over time, the daily additions will gradually fall behind relative to the tree. Let me know if you need another update in the future.

yatisht / usher

Different number of Sequences in Tree and Fasta #356