xavierdidelot / BactDating

Bayesian inference of ancestral dates on bacterial phylogenetic trees
https://xavierdidelot.github.io/BactDating
MIT License
81 stars 15 forks source link

Take population structure into account in temporal signal assessment? #16

Closed flashton2003 closed 4 years ago

flashton2003 commented 5 years ago

Hi Xavier,

This is more like a 'feature request' than an issue, and even then, it's just an idea to discuss really, rather than a definite request. Doing on github rather than by email so others can see.

I've been reading around the issue of temporal signal detection and came across this nice paper by Murray et al in 2016. They talk about the idea of cluster permutation, first raised by Duchene et al in 2015.

Basically, they make the point that it's really easy to confound genetic similarity and temporal similarity in the type of sampling which often happens in microbial genomic epidemiology studies. Their solution is to randomly permute samples within 'clusters' rather than across the whole tree.

The BactDating analysis I was doing was actually a great example of this confounding.

Screenshot 2019-09-03 at 16 32 22

The top clade with the later samples (tips in red) is primarily from Vietnam, while the basal clade(s) were primarily from China. Vietnam isolates were sampled >2010, Chinese isolates <2009. That was enough to get a significant signal when the randomisation takes place over the whole tree (rather than within the 'clusters'). I probably should have noticed this by just visually inspecting the data, before running BactDating, but better late than never.

The reason I'm raising this issue is to get your opinion on how/whether/if this could be implemented in BactDating. Duchene et al say that clusters are monophyletic clades with the same sampling date, which seems quite stringent (I think my tree would pass that test, as there are not many large clades which share the same sampling date). This got me thinking whether you could do a PCA to capture population structure and use a couple of dimensions of it as a co-variables in the root-to-tip regression? Then you could see if the population structure was confounding your temporal signal (I think?).

Hope this makes sense, and would be great to hear what you think.

Phil

flashton2003 commented 5 years ago

Having finished reading the Murray et al paper, you can just look for association between the pairwise genetic distances and the absolute distances between the sampling dates using the Mantel test. I guess this would be quite easy to implement in BactDating, and might save some people from some incorrect findings.

Let me know if you'd be interested in a pull request implementing this (fair warning - my R is ... rudimentary).

xavierdidelot commented 5 years ago

Hi Phil,

Thanks for your posts, I find this problem of confounding between population structure and temporal signal very interesting indeed and it would be great to implement a test in BactDating that is robust to this effect. A clustered permutation test sounds good to me, if we can define the clusters in a principled way, ie not assume the cutoff of a year.

A Mantel test to assess the association between genetic and temporal distances seems a good idea, better than PCA since it uses the phylogeny that we want to test. Perhaps this could be used repeatedly to automatically find the best temporal window for which to cluster monophyletic clades, rather than just to test whether a given value is enough? I mean we could start with a very small window (eg a day) and progressively increase it until the Mantel test becomes negative, and that would be the value to be used for the clustered test?

flashton2003 commented 5 years ago

Just to be clear, would the repeated Mantel test involve modifying the resolution of the dates within the distance matrix? e.g. start with resolution of one day, and then modify to week, month, year, bi-year, ...?

Is there a reason for the clusters to be defined as monophyletic groups sharing some temporal similarity? Could you not 'just' define phylogenetic clusters and randomise tip dates within them?

What is the advantage of the clustered permutation method over the Mantel test alone? Personally, I wouldn't be confident in implementing the clustered permutation method.

xavierdidelot commented 5 years ago

I think we need both the Mantel test and the clustered permutation test. The former tells us when there is potential confounding between lineages and sampling dates, whereas the latter tells us what the significance of the temporal signal is once we take this confounding into account.

In answer to the question in your first paragraph, yes, in order to find the optimal temporal window for the Mantel test we would need to repeat the Mantel test with increasingly large number clusters being combined, until the test becomes insignificant. It may also be good to let the user specify the clusters to be used as I think you are suggesting in your second paragraph, but I think it's useful to have an automatic procedure for this too.

Does that make sense? There is already a good implementation of the Mantel test in R in the vegan package so there is no need to rewrite this. The clustered permutation method would not be hard to write, but it's probably easier for me to do since I'm familiar with the code.

flashton2003 commented 5 years ago

Right, of course! Thanks Xavier.

xavierdidelot commented 5 years ago

Great, I'll try to implement this next week.

xavierdidelot commented 5 years ago

Hi Phil,

I've just implemented a first version of this "incremental" clustered permutation test. You can use it via the clusteredTest function and there is a vignette showing an example here. Could you please give a try and let me know what you think? The window increment is a whole year which might be too much in your case but this could be changed of course.

Also it's interesting that the initial Mantel test is often significant even on simulated datasets where there was no biased sampling for one lineage rather than another. This implies that even in these conditions we cluster some leaves, and lose power in the root-to-tip test. This effect will be particularly strong when the TMRCA is small relative to the sampling window.

I wonder if it might be possible to be less conservative in the way clusters are built, for example using the expectation for how significant the Mantel test would be under a coalescent model without biased sampling.

flashton2003 commented 4 years ago

Hi Xavier,

I've tried the method, I didn't have any problems running the code. I was a bit surprised to see that I had a significant relationship still. Perhaps the temporal window is a bit off for this dataset like you say. My gut instinct, looking at the output plot is that lineage and time of sampling are still confounded, but that's just my instinct.

Screenshot 2019-09-20 at 14 06 25

xavierdidelot commented 4 years ago

Thanks Phil. I'm going to close this issue now and take the conversation offline via email. If anybody else is interesting in this please do not hesitate to email us.