xavierdidelot / ClonalFrameML

ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes
GNU General Public License v3.0
108 stars 27 forks source link

Inquiry on difference of Recombination Visualization between cfml_results.R and Phandango #140

Closed Tonny-zhou closed 1 year ago

Tonny-zhou commented 1 year ago

Dear Prof. Xavier Didelot,

I am writing to consult you on a question related to the visualization of bacterial recombination in the cfml_results.R script and Phandango tool. After examining the cfml_results.R script, I noticed that if a recombination event occurs at a non-terminal internal node, the output graphic only displays a heatmap for the recombination event corresponding to the respective node in the phylogenetic tree (i.e., just one row). On the other hand, in the Phandango tool, it appears that if a recombination event occurs at a non-terminal internal node, the heatmap will display the recombination event for all leaf nodes belonging to this internal node (i.e., multiple rows). In your opinion, which of these visualization methods would provide a more accurate representation of the recombination events among bacterial strains? I would appreciate any insights or suggestions you could provide on this matter. Thank you for your time and consideration.

Best regards, Tonny

xavierdidelot commented 1 year ago

Hi Tonny,

Yes you are absolutely correct about this difference in representation of recombination events between the ClonalFrameML R script (1) and phandango (2). I think there are really pros and cons to both. What is nice about (1) is that it shows each recombination in the same way on a single row, no matter how many genomes are affected. When an event affects many genomes (which could even be all minus one) then it seems a bit inefficient to show it as (2) for all affected genomes. Also (2) becomes problematic if there is overlapping recombination on branches leading from one to the other since they will be shown on top of each other, whereas (1) will always show the event on separate rows when they affect separate branches. But the disadvantage of (1) is that we need to show a row for each node (ancestral or terminal) whereas in (2) we only need to show a row for each terminal node, ie for each genome. So when analysing many genomes (1) can end up having more rows than can be seen clearly. I think it makes sense to use (1) if there are not too many genomes and/or quite a lot of recombination events and (2) if there are a lot of genomes and/or not too many recombination events.

Best wishes, Xavier