xavierdidelot / TransPhylo

Reconstruction of transmission trees using genomic data
http://xavierdidelot.github.io/TransPhylo/
GNU General Public License v2.0
60 stars 22 forks source link

Question about inferTTree parameters #24

Closed am-epi closed 1 year ago

am-epi commented 1 year ago

Hi,

Quick question on some of the parameters that can be used within the inferTTree command as I am wanting to include startOff.r, start.Off.p, startNeg and startPi. I would just like to check I've understood what each of those parameters are so I can adjust accordingly, I'm very new to modelling and very much in at the deep end!

Off.r is the R0 number for the pathogen of interest and to clarify, is Off.p the probability of transmission from one individual to another?

Neg represents the average time of coalescence of two lineages - could I use the clock rate from my input dated BEAST tree for this?

Finally, pi = sample fraction which represents the proportion of population units that are selected in the sample - so the estimated number of individuals sampled across the phylogeny being investigated?

Thanks in advance for any help you can offer, Abbi

xavierdidelot commented 1 year ago

Hi Abbi,

off.r and off.p are the r and p parameters of the Negative Binomial distribution used as offspring distribution. I would suggest at least to start with to keep off.p=0.5 (which is what happens by default), in which case off.r is indeed the mean of the distribution R0. See https://en.wikipedia.org/wiki/Negative_binomial_distribution

Neg represents how quickly two lineages coalesce within the same host and pi is indeed the sampling fraction.

Note however that these parameters will be estimated by inferTTree (unless you ask inferTTree not to using for example updateNeg=F) and the start values are just starting points for this estimation, so in most cases you can leave these starting points as default and the correct values should be inferred.

Best wishes, Xavier

am-epi commented 1 year ago

Hi Xavier,

Thanks for your quick response, really appreciated. I wanted to give the starting values as I'm struggling to get good MCMC convergence and ESS parameters on my inferTTree runs and thought a better baseline might help with the inference.

For example using a dated BEAST tree (n=375 sequences) and the following: inferTTree(ptree,mcmcIterations=1e5,w.shape=6.76,w.scale=1.92,dateT=2022.883, thinning = 10) my outputs are: pi = 2788.710903 | neg = 6.351784 | off.r 15.913947 | off.p 0.000000 Result from TransPhylo analysis: pi=9.92e-01 [9.80e-01;9.99e-01] neg=7.49e-02 [6.12e-02;9.16e-02] off.r=1.17e+02 [1.06e+02;1.27e+02] off.p=5.00e-01 [5.00e-01;5.00e-01]

If I run the same for 1e6 I get the following:

pi = 17430.40921 | neg = 72.59712 | off.r 82.49304 | off.p 0.000000 Result from TransPhylo analysis pi=9.92e-01 [9.80e-01;9.99e-01] neg=4.27e-02 [3.46e-02;5.05e-02] off.r=1.14e+02 [1.02e+02;1.28e+02] off.p=5.00e-01 [5.00e-01;5.00e-01]`

I really feel something isn't right/happy in the inference... my w.shape and w.scale were generated using the generation time (15 days) and standard deviation (5 days) to get the coefficient of variation and then converted using epitrix... And I'm using figures from published work....

This may beyond your help on here, but I wanted to 'explain my workings' so to speak, in case I have gone wrong somewhere!

Thanks, Abbi

xavierdidelot commented 1 year ago

It seems that pi takes a value really close to 1, which would mean that almost all cases are sampled. Would that seem plausible based on what you know about the outbreak? One possible mistake would be if you have w.shape and w.scale measured in days (as you wrote) but the dated tree measured in years. You need to use the same unit for both. Could you please confirm?

am-epi commented 1 year ago

The phylogeny is based from bacterial isolates recovered as part of laboratory surveillance initiative over a 7 year period across the UK, and we can safely say that the isolates gathered are a small fraction of the total affected individuals. If we were to look at just one outbreak as an example I would say that you'll only typically get 10-20% of affected individuals sampled - although we are unable to get data on the number of unsampled individuals unfortunately, so would only be able to estimate.

I think you may be onto something with the w.shape and w.scale parameters. I did use days for them and the input tree is in years I believe - would it be a case of converting the days into a year figure then running the epitrix commands?

xavierdidelot commented 1 year ago

You could specify directly the mean and std of your generation time distribution using the w.mean and w.std arguments. The unit needs to be the same as used in the dated tree, so you could use w.mean=15/365,w.std=5/365 to have it all in years. Or equivalently you could have w.mean=15,w.std=5 and rescale your tree to be in units of days using t$edge.length=t$edge.length*365

am-epi commented 1 year ago

Ok thanks, so would the inferTTree command look something like this:

inferTTree(ptree, mcmcIterations=1e5, w.mean=15/365, w.std=5/365, w.shape=6.76, w.scale=1.92, dateT=2022.883, thinning = 10)

Or would I then exclude the w.shape and w.scale as the inference will use the w.mean and w.std? And would it then still help to include start parameters as well?

xavierdidelot commented 1 year ago

You should remove the w.shape and w.scale arguments. Try without start parameters, I think it should work fine.

am-epi commented 1 year ago

Okay I will give that a go - thank you so much for your help!

am-epi commented 1 year ago

Hi Xavier, Using the w.mean and w.std (also in the right format), and it has definitely helped change the inference (for 1e5 MCMC iterations):

pi = 38.64554 | neg =130.23302 | off.r 1637.97523 | off.p 0.0000

Result from TransPhylo analysis pi=1.00e-02 [1.00e-02;1.01e-02] neg=1.95e-02 [1.54e-02;2.48e-02] off.r=1.01e+00 [9.98e-01;1.02e+00] off.p=5.00e-01 [5.00e-01;5.00e-01]

It looks to still be struggling with the sampling fraction, shall I add in a startPi parameter to help with this? Or run the MCMC for longer?

xavierdidelot commented 1 year ago

Yes this looks a lot better. You could add startPi=0.01,startNeg=0.02 to save a bit of time, but running for longer would is also be a good idea and you should then get good mixing.

am-epi commented 1 year ago

Thanks for this, certainly helped get me in the right direction. I think I need to test a few different sampling proportion options to get the best fit but I'm getting there.

Thanks for all your help!