yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

NC_045512 shows up in MAT when extracting from a WNV auspice tree #378

Open jcw349 opened 3 days ago

jcw349 commented 3 days ago

Hi,

I am trying to convert a West Nile virus auspice tree to a MAT. The matUtils extract keeps including "NC_045512" in the output file, but it's not in my input auspice file.

matUtils extract -i auspice/WNV-global.json -o usher/WNV-global.pb

I tried specifying the reference files and metadata as well, but it's still creating the same MAT. matUtils extract -i auspice/WNV-global.json -g config/reference.gtf -f config/reference.fasta -o usher/WNV-global.pb

reference: NC_009942

VCF from the output MAT: matUtils extract -i usher/WNV-global.pb -g config/reference.gtf -f config/reference.fasta -v usher/mutations.vcf

First 5 rows and 12 columns of the vcf:

##fileformat=VCFv4.2
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  AY765264        KJ831223        FJ159130
NC_045512       1       A1C,A1G,A1T     A       C,G,T   .       .       AC=16,67,3;AN=4560      GT      0       0       0
NC_045512       2       G2A,G2C,G2T     G       A,C,T   .       .       AC=12,4,20;AN=4560      GT      0       0       0
NC_045512       3       T3A,T3C,T3G     T       A,C,G   .       .       AC=89,25,4;AN=4560      GT      1       0       0

Not sure what I'm doing wrong.

Thank you, Jade W

AngieHinrichs commented 2 days ago

Sorry about that @jcw349! When writing VCF, "NC_045512" is hardcoded! We should be able to do better than that.

Hopefully something like this will work for you in the meantime?:

sed -e 's/^NC_045512/NC_009942/;' usher/mutations.vcf > usher/mutations.renamed.vcf
jcw349 commented 2 days ago

Sorry about that @jcw349! When writing VCF, "NC_045512" is hardcoded! We should be able to do better than that.

Hopefully something like this will work for you in the meantime?:

sed -e 's/^NC_045512/NC_009942/;' usher/mutations.vcf > usher/mutations.renamed.vcf

No worries!! Thank you for looking into this and sharing a solution to fix the VCF.

The matUtil extract -i <json_file> -o <mat.pb> also seems to be labeling non-covid trees the same reference, NC_045512. It was in the output MAT file too. Not sure if that'll have a big impact on doing other things with the file?

For now I remade the MAT.pb using UShER, which did end up using NC_009942, but it didn't keep all of the same things I had initially put into the nextstrain tree like filters and colors, etc.

AngieHinrichs commented 2 days ago

Yes, the MAT protobuf contains only the mutation annotated tree, not the other many things that can be layered onto Nextstrain's Auspice JSON format. The matUtils extract options for adding in a reference, metadata etc. are only used when generating JSON output AFAIK.

jcw349 commented 1 day ago

Sorry, what I mean is, when I tried to use matUtil extract to convert the Auspice json to MAT protobuf, for some reason the MAT.pb ended up having the NC_045512 in it too, even though that's not in the input json.

image

AngieHinrichs commented 1 day ago

Yes, NC_045512 is hardcoded when importing JSON, sorry. Does auspice/WNV-global.json contain "NC_009942" anywhere in it? (If you're able to share auspice/WNV-global.json privately then I can take a look at how matUtils might figure out what the reference name should be.)