yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
121 stars 40 forks source link

matOptimize sometimes failing when --vcf used #323

Closed martinghunt closed 1 year ago

martinghunt commented 1 year ago

I'm getting this error on 3 of my 8 data sets (5 run perfectly) when running matOptimize:

old node id 2052 not found, condensed vector addr 14cd87f66c98

I guess is hitting this: https://github.com/yatisht/usher/blob/b53483e1b96447ae3846688f812b16a8859368fa/src/matOptimize/condense.cpp#L137-L139

I've attached example input files. Command I'm running is:

matOptimize --vcf fatoVcf.vcf.gz -T 10 -r 8 -M 2 -i tree.pb -o out.pb

...which throws that error. It runs fine without --vcf fatoVcf.vcf.gz. I think this is a bug?

usher_files.zip

russcd commented 1 year ago

@yceh can you take a look at this when you get a moment?

yceh commented 1 year ago

Could you please tell us more on how the tree.pb is generated? It seems to be corrupted, as one leaf node name is an empty string. condensed_nodes { node_name: "node_1_condensed_2_leaves" condensed_leaves: "" condensed_leaves: "SRR18554366" }

martinghunt commented 1 year ago

tree.pb was made with Usher. The whole pipeline I'm running is:

  1. Use mafft --keeplength on each sequence vs ref
  2. faToVcf -includeNoAltN <MSA fasta from 1> <out.vcf>
  3. usher --sort-before-placement-3 --vcf <VCF from 2> --tree <empty tree file> -T 20 --save-mutation-annotated-tree tree.pb (empty tree file just contains ())
  4. matOptimize -T 20 -r 8 -M 2 -i <tree.pb from 3> -o tree.optimized.pb
  5. usher_to_taxonium --input <tree.optimized.pb from 4> …

I've attached the MSA fasta here used in step 3. Have double-checked that none of the sequence names are empty. There's the ref first NC_045512 and then the rest are all called SRR*. msa.fa.gz

yceh commented 1 year ago

Could you please use usher-sampled instead of usher (command line arguments are the same)? This is a fix in empty tree handling that is not back-ported to usher. By the way, it is better to run matOptimize with argument -m 0.000000001 for small trees, so it won't stop too early when there are still optimization opportunities. Thanks!

martinghunt commented 1 year ago

Thank you! It's all working perfectly for me now.