yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

Easily extract metadata for node ID of interest identified through ripples + general usage query #333

Closed charlesfoster closed 1 year ago

charlesfoster commented 1 year ago

Hi,

I've been trying out using ripples for exploratory tasks. Basically, I was hoping that if I come across strange samples I suspect of being recombinant that I could place them on the global tree using usher, downsample the tree to a representative set of samples per clade using matUtils ( to save on computational time, while also keeping the new sequences), then run ripples. I would then parse the results to find clades with solid evidence of being recombinant and see if the new samples fall within those clades. Finally, I would check the lineages of the putative donor and acceptor clades to try and narrow down the donor/acceptor lineages.

matUtils extract --get-representative 20 -o downsampled.pb.gz -i public-latest.all.masked.pb.gz
usher -T 10 -i downsampled.pb.gz -v aligned_seqs.vcf -o user_seqs.pb
grep -e '>' test_samples.fa | perl -pi -e 's/>//' > user_samples.txt
mkdir USER_SAMPLES/
ripples -i user_seqs.pb  -s user_samples.txt -d USER_SAMPLES/ -T 10
# investigate output
awk '$11 - $12 > 5' USER_SAMPLES/recombination.tsv
# from the output above, e.g. node_26168 is identified as recombinant
# donor: e.g. node_41876; acceptor: e.g.node_26139
matUtils extract -i user_seqs.pb -I node_41876 -o node_41876.pb.gz
matUtils summary -i node_41876.pb.gz -c node_41876.clades.tsv
matUtils extract -i user_seqs.pb -I node_26139 -o node_26139.pb.gz
matUtils summary -i node_26139.pb.gz -c node_26139.clades.tsv

Does this seem reasonable, or am I misusing/misunderstanding the purpose of these tools? Is there a smarter way to extract the information I want at the end instead of 2x extract and 2x summary commands?

Apologies for these basic questions!

Thanks, Charles

jmcbroome commented 1 year ago

Hi Charles-

I'm not the right one to speak to the details of RIPPLES or the implications of your downsampling procedure on its results, but I can say that you can probably have simpler options than 4 matUtils calls. If I understand correctly, you're interested in examining all the clades descended from your putative recombinant? You may want to try matUtils extract -C, which will yield the set of all clades that are in your subtree (along with mutations defining them), to avoid calling summary.

Alternatively, if you're willing to do some scripting in Python, you can avoid repeated matUtils calls altogether by using BTE, our Python API. It would be straightforward to write a script that finds the matching node for each entry in your recombinant output table and computes the downstream clade membership, while only requiring loading the tree once overall.

yatisht commented 1 year ago

In addition to what Jakob mentioned, I would suggest you use ripples-fast, which produces the same results ripples, but is a lot faster. Also, are you aware of RIVET (http://rivet.ucsd.edu/)? It can be used to automate your pipeline and also visualize your recombinants using instructions provided in https://github.com/TurakhiaLab/rivet.

charlesfoster commented 1 year ago

Thank you both for your replies. I'll check out using matUtils extract -C, ripples-fast, and rivet. BTE also sounds useful, thanks.