xavierdidelot / BactDating

Bayesian inference of ancestral dates on bacterial phylogenetic trees
https://xavierdidelot.github.io/BactDating
MIT License
80 stars 15 forks source link

Help on interpreting results - strenght of temporal signal #50

Closed andreaniml closed 2 years ago

andreaniml commented 2 years ago

Hello! I'm new to BactDating and I'm trying to run it on a phylogeny of 435 samples of Salmonella. The phylogeny was generated by creating a snp alignment using Snippy and passing this to Gubbins.

I got the dates from NCBI and wanted to compare BactDating with TempEst regarding temporal signal. Currently I ran into some issues (I think). My workflow: 1)As NCBI sometimes has only the year specified and sometimes year-month-day, for creating the decimal dates format I copied how TempEst imported the dates (example: a year turned from 2015 to 2015.0) 2) I generated the dates vector by importing the table with "tip \t date" information and used d = c(Table$dates) and names(d) = c(Table$tips) 3) Then I generated the analysis and a set of root to tip graphics: t = LoadGubbins(prefix ='my_prefix') rooted=initRoot(t,d) r2=roottotip(t,d) r2=roottotip(rooted,d) res=bactdate(t,d) plot(res,'treeCI') plot(res,'trace')

Unrooted tree: image Rooted Tree: image trace: image treeCI: image

So, here are my concerns:

1)Apparently I do have significant signal (the p value is low) but the line is too flat, my R is below 0.00. Both in rooted and unrooted trees. I do have two outliers (one is obvious from the BactDating plot, the other is only visible on TempEst) 2) My trace seems a little thin, but I think that given the root to tip graphics results I would rather change my dataset a little than run it longer 3)My TreeCI output does not show the years correctly on the bar, I'm wondering If I actually messed up on my input (why is it backwards? Do I have to insert a backwards = False flag somewhere and forgot?)

Thanks!

xavierdidelot commented 2 years ago

The treeCI is fine, with negative dates representing dates BCE. But it seems you have one genome that is really different from the others. I think you should remove this genome from your tree before trying to date it. This genome could still be useful though to establish where the root it (ie use if for outgroup rooting), and then you would not have to infer this as part of the dating (via usingRoot=F)

andreaniml commented 2 years ago

Thanks! I've removed the genome and the analysis is running! Just an aditional quick question: I've seen on the vignetes that you should rescale the tree if the branch lenghts are measured per site. Whenever I run an analysis on gubbins, I can generate a tree with support or without it. If I choose the SH support sometimes it rescales to "substitutions per site", I would have to do the t$edge.length=t$edge.length*L thing to use the sh supported tree, even if it is an input from Gubbins, right? (or simply avoid using sh supported tree altogether and do a simple 100 bootstrap tree instead?)

xavierdidelot commented 2 years ago

Yes that's right. If you can get Gubbins to output the tree in substitutions rather than substitutions per site then this would probably be the easiest and best.

andreaniml commented 2 years ago

Hello! Sorry to bother you again, but now I have some weird results. I removed problematic sequences and ended up with 409 When I try to do the r=roottotip(t,d) with the unrooted tree I get a negative correlation: image If I root the tree: image Now I have a positive correlation that's a little bit better than last time And here is my trace of the rooted tree: image

I found it strange that I could get a negative correlation and have an idea of what could be the problem: When I made the alignment using snippy-multi, I did not remove the reference and my reference is "younger" than some of my other samples, I told gubbins to use the Reference as an outgroup, could just one datapoint in 409 cause such trouble? (When I run initroot it roots on another genome)

I think I can run snippy-core --no-ref to generate my alignment without the Reference sequence, but I'm not sure if this is the problem.

I can then follow the vignettes to evaluate the strength of signal, I just thought that getting a negative correlation could be a red flag already.

xavierdidelot commented 2 years ago

In your first plot you get a negative correlation because the root is wrong. In the second plot I guess you rooted using initRoot which finds the root maximising the R2, but this is not really principled either. The best root is the one found when you run bactdate without constraining the root, as you did in your last plot, because as you can see in the last panel the MCMC explores the uncertainty in the root location. This last result looks good to me (ideally doing 10x more iterations)