yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 40 forks source link

haplotypes gives mutations and their reversions in `matUtils summary --haplotype` #326

Closed jbloom closed 1 year ago

jbloom commented 1 year ago

Not sure if this is a bug or just non-intuitive behavior, but matUtils summary --haplotype gives entries like A10323G,G10323A for sequences that got a mutation and then reverted it.

In my mind, this would be better described as a mutation history than a haplotype, as the sequence of these two mutations ends up leading to no change, so the resulting haplotype is identical to a sequence that never got a mutation at site 10323.

jmcbroome commented 1 year ago

I would describe this as non-intuitive behavior resulting from some naive implementation early on. I think, on the whole, for this output we probably want to negate these mutations with respect to the haplotype, so that counts for a haplotype with two reversions at site N will be lumped with an identical haplotype that never had any changes at site N. I will look into it this week. Thank you for raising this issue!

corneliusroemer commented 1 year ago

I haven't used matUtils, but I think it would be good to retain the ability to extract the "mutation history" or "path", even if not as --haplotype. Maybe this is already possible with a different subcommand.

Basically, these sort of flip-flops can be very useful to locate problematic parts of a tree.

If you don't want to break backwards compatibility, you could add an option for @jbloom's proposed cancellation mode as --true-haplotype while maintaining the old behaviour with the flag.

If you are happy to break backwards compatibility you could add the old mode as --mutation-path.

Do you currently treat the haplotype as an unordered set of mutations that occur from root to tip? If so, there could be a third way to summarize: by ordered list of mutations to the tip.

Probably @AngieHinrichs knows this stuff inside out :)

I'm thinking of adding Usher paths/haplotypes to pango-sequences's summary.json (https://github.com/corneliusroemer/pango-sequences) so will look into it in a bit.

jmcbroome commented 1 year ago

you can use matUtils extract -S to get the full mutation path for any or all samples, which is why I'm leaning towards producing a true haplotype for matUtils summary --haplotype output.

jmcbroome commented 1 year ago

Additionally, if you're interested in adding UShER paths to the summary.json, you will probably want to use matUtils extract -C, which is explicitly intended for producing the paths of every annotated clade.

russcd commented 1 year ago

I believe this is resolve via @jmcbroome's PR.