xavierdidelot / ClonalFrameML

ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes
GNU General Public License v3.0
108 stars 27 forks source link

Question: r/m calculation with CI95%! #119

Closed maesaar closed 3 years ago

maesaar commented 3 years ago

Hello @xavierdidelot, I have read the issues #99 and #114 regarding CI95% calculations for r/m. Do I understand it correctly:

CFML has been ran with following parameters:

ClonalFrameML v1.12
xmfa_file = true
em = true
emsim = 100
output_filtered = true
show_progress = true

First to get overall CI95% I have to use following input like in issue #99.

Parameter Posterior Mean Posterior Variance a_post b_post
R/theta 0.297706 4.0978e-05 2162.84 7265.02
1/delta 0.000530892 1.40412e-10 2007.28 3.78095e+06
nu 0.0157936 4.17345e-09 59768 3.78431e+06

Second I calculate r/m as follows: r/m = 1/1/delta * R/theta * nu therefore the r/m = 1/0.000530892 * 0.297706 * 0.0157936 = ~8.86

Third I calculate CI95% for each parameter using the formula in R qgamma(c(0.025,0.975),shape=a_post,rate=b_post): R/theta CI95% is calculated qgamma(c(0.025,0.975),shape=2162.84,rate=7261.02) 1/delta CI95% is calculated qgamma(c(0.025,0.975),shape=2007.28,rate=3780950) nu CI95% is calculated qgamma(c(0.025,0.975),shape=59768,rate=3784310)

Summary of results of CI95% calculations:

Parameter 0.025 0.975
R/theta 0.2854475 0.3105534
1/delta 0.0005079198 0.0005543673
nu 0.01566726 0.01592050

Fourth I will calculate 0.025:r/m(0.025) = 1/0.0005079198 * 0.2854475 * 0.01566726 = ~8.81

Fifth I will calculate 0.975: r/m(0.975) = 1/0.0005543673 * 0.3105534 * 0.01592050 = 8.92

Finally the r/m(CI95%) is 8.86[8.81-8.92]

So here are my question: 1) Are my previous calculations correct? 2) When comparing r/m from different CFML runs (same bacteria, different STs) is it correct to assume if CI95% ranges don't overlap that the r/m differs significantly (p=0.05)? 3) In the issue #114 the r/m is calculated slightly differently (per-branch CI95% values multiplied with mean) is it due to the -embranch true parameter and because of the values being factors as stated in issue #87.

Thank you in advance

xavierdidelot commented 3 years ago

When using the emsim option there is a much simpler (and better!) option to calculate the 95%CI of r/m. Each line of the file ending with suffix emsim.txtcontains sampled values of R/theta, delta and nu. Multiply these three values to get the sampled values of r/m for each line. The 95%CI is then obtained by taking the 2.5% and 97.5% quantiles of these sampled values. In R you can do this using: quantile(values,probs=c(0.025,0.975))

Yes if two 95%CI do not overlap then it suggests that there is a significant difference at the p=0.05 level.

When using embranch option things are a bit different since there will be parameter estimated for each branches.

maesaar commented 3 years ago

@xavierdidelot thank you!

But in principle the calculations are correct either way? By calculating as I did and also as you suggested?

xavierdidelot commented 3 years ago

The intervals you get using my method based on the output of emsim are more statistically valid than the ones calculated using the method you described, but otherwise yes what you described is a correct application of what I had described in other issues.

maesaar commented 3 years ago

Thank you again