Open wtporter opened 8 months ago
Thanks for pointing out the discrepancy @wtporter, I will look into it.
I've been adding new public sequences from the daily build to the public MSA, but that misses quite a few sequences over time because sometimes a new sequence is available from GISAID earlier than from public repo like GenBank, so the GISAID version of the sequence is aligned to reference and added to the tree -- and then later, when the public version becomes available, it is renamed in the tree instead of being aligned & added. So I needed to round up 1.7 million missing sequences, align them and add them to the MSA.
I have done that for the 2023-10-30 tree:
http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/2023/10/30/
Over time, the daily additions will gradually fall behind relative to the tree. Let me know if you need another update in the future.
Hi, Comparing the public-latest.all.fa sequences to the nwk and tsv metadata file and it appears that there is a discrepancy within the sample numbers. Within the .fa file there are ~6.6 million and the tsv and tree have ~8.3 million sequences. Is the fasta reduced to just unique sequences or is there an issue preventing all ~8.3 million sequences from being written in the fasta? http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/ Thanks for this great resource!