High recombination sequence detection

xavierdidelot / ClonalFrameML

ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes

GNU General Public License v3.0

109 stars 27 forks source link

High recombination sequence detection #139

Closed conmeehan closed 1 year ago

conmeehan commented 1 year ago

Hi,

I have a dataset of some Mycobacteria and am trying to remove recombination from a core genome alignment. When I run CFML I get some sequences that have recombination signals throughout (>60% of the genome detected as such, showing as dark blue in the PDF from the RScript). If I remove these sequences and run everything again, a different set of sequences then pops up as being high in recombination.

Is there any reason for this and any way to stop it? I dont want to remove sequences if they should be kept, but need to ensure recombination free alignment is created. Any help appreciated!

Cheers, Conor

xavierdidelot commented 1 year ago

Hi Conor,

Could you check the value of nu inferred before and after removing sequences? This can be found in the file with suffix .em.txt. This value represents the level of polymorphism in recombined regions. It is possible that ClonalFrameML infers a value that is too low after you removed the most divergent sequences. You can stop this from happening by changing the prior on nu. For example to force nu to be around 0.05 you could use -prior_mean "0.1 0.001 0.05 0.0001" -prior_sd "0.1 0.001 0.0001 0.0001". Note that ClonalFrameML is mostly designed to analyse sequences that are part of the same species or even the same lineage within a species, so if your dataset contains multiple species then you might need to help ClonalFrameML a bit by specifying a strong prior on nu as described above.

Best wishes, Xavier

conmeehan commented 1 year ago

Hi Xavier,

Thanks for the quick response. The nu before removal is: Parameter Posterior Mean Posterior Variance a_post b_post nu 0.0324815 1.00773e-09 1.04696e+06 3.22323e+07

After removal it is: Parameter Posterior Mean Posterior Variance a_post b_post nu 0.0452501 2.75491e-09 743243 1.64252e+07

So I dont think it is moving too much but perhaps I should set it as you say and see if that affects the outcome? These are all the same species but indeed may be separate lineages.

Cheers, Conor

xavierdidelot commented 1 year ago

Yes these values of nu look fine, both before and after removal, so there is no issue with this and no need to try changing the prior on nu. It's good to know that all genomes are from the same species, I guess having ~60% recombined on some branches is not impossible, or there could be mistakes in the alignment that look like recombination events. When you remove some sequences to try to remove recombination events you would need to make sure that you remove all sequences affected by the events, ie all the sequences that are below the branch on which there is recombination. Don't hesitate to email me if you're still having problems with this as I would need to see what the results look like.