zdk123 / SpiecEasi

Sparse InversE Covariance estimation for Ecological Association and Statistical Inference
GNU General Public License v3.0
192 stars 68 forks source link

Problem reproducing AUPR values from the paper #89

Open vincentprost opened 5 years ago

vincentprost commented 5 years ago

Hi,

I'm trying to reproduce the AUPR values from Figure 4 of the SpiecEasi PLoS paper following the tutorial code examples but I end up with quantitatively very different values than in the paper. More specifically, I tried to reproduce the AUPR values for the scale-free topology with a kappa (condition number) of 10 with the following code:

dim(amgut1.filt) [1] 289 127

set.seed(10010) graph <- make_graph('scale_free', d, e) Prec <- graph2prec(graph, targetCondition= 10) Cor <- cov2cor(prec2cov(Prec)) X <- synth_comm_from_counts(amgut1.filt.cs, mar=2, distr='zinegbin', Sigma=Cor, n=1360) se <- spiec.easi(X, method='mb', lambda.min.ratio=1e-2, nlambda=30) huge::huge.roc(se$est$path, graph, verbose=FALSE) True Postive Rate: from 0.007874016 to 1 False Positive Rate: from 0 to 1 Area under Curve: 0.7009949 Maximum F1 Score: 0.6958635 stars.pr(getOptMerge(se), graph, verbose=FALSE) True Postive Rate: from 0 to 0.2755906 False Positive Rate: from 0 to 0.1195072 Area under Curve: 0.01535282 Maximum F1 Score: 0.3950842

The corresponding panel of figure 4 shows AUPR close to 1 for that topology, which contrasts with the value (0.01535282) obtained above.

Thank you for any hint and for the very nice toolbox!

PS: by the way the dimensionality of the amgut1.filt dataset from the SpiecEasi repository (p=127) appears to differ from the one used in the paper (p=205)

zdk123 commented 5 years ago

Thanks for the report. I am aware of the issue - and I have a fix that is awaiting integration.

The README was not originally intended to be a reproduction of the paper (I used a stricter threshold here, purely for faster execution time), so apologies for the confusion.

vincentprost commented 5 years ago

Hi,

I noticed recent updates on the spiec-easi repository and thought these might impact the aupr value problem reported above, but this appears not to be the case.

To recap, I have trouble with the following two observations:

1) The spiec-easi functions to compute the aupr values yield very low figures; for example the attached plot PR_curve is associated with the numerical value 0.0668 (both the curve and the value are generated by spiec-easi) . On the other hand, when recomputing the aupr values from scratch (across the full [0,1] domain) from the same stars-based sorted edges, I get higher numerical aupr values (0.856).

2) However and more importantly, neither the pr plot nor the aupr values computed for, e.g. the scale-free topology (see attached plot aupr_plots ) come close to the values illustrated in the figure 4 of the PLoS paper.

I'm aware that the amgut1 dataset used in these tests is thresholded differently than in the paper, but I assume this alone can't account for these observations.

Thanks for any help on that matter. With best wishes, Vincent Prost

vincentprost commented 5 years ago

I've noticed a bug that might explain my issue. I think the matrix conversions for instance line 178 in mvdistributions.R: data <- matrix(VGAM::qzinegbin(unif, munb=munbs, size=ks, pstr0=ps, ...), n, d) leads to wrong generation of correlated variables.

zdk123 commented 5 years ago

thanks for bumping this issue @vincentprost. Your PR fixes seem correct - the main issue in the main branch is that precision and recall are swapped when calculating AUPR and the end points are not anchored. Clearly this doesn't completely resolve the issue.

In your last comment, can you explain how you think that variables are generated incorrectly? I took a look this morning and, for example, filling by row seems to do much worse.

vincentprost commented 5 years ago

Many thanks for the feedback! To be more explicit, besides the issue of anchoring the endpoints, changing the code line number 178 in mvdistributions.R (affecting the data points generated from the zero inflated negative binomial distribution) from:

data <- matrix(VGAM::qzinegbin(unif, munb=munbs, size=ks, pstr0=ps, ...), n, d) into data <- t(matrix(VGAM::qzinegbin(t(unif), munb=munbs, size=ks, pstr0=ps, ...), d, n))

significantly improves the AUPR plots (see below), especially for the scale-free topology (but still not matching the values of the plos paper figure 4 for the latter). What do you think?

aupr_plots

Thanks again for your valuable work! Vincent

ps: another striking observation related to the aupr for the scale-free topology, the effect of the condition number (kappa) in the above plot runs in the opposite direction than in the Plos Fig4 (i.e. kappa 100 results are better).

IsabellaMendler commented 4 years ago

Hi,

is there any news regarding this issue?

@vincentprost I think your suggestion to change the code in mvdistributions.R is correct. The heatmap for the synthetic data looks like this, if I do not use your fix: Rplot04.pdf, which is very different from the heatmaps shown in Fig. S2 of the paper. After using your fix, it looks like this: Rplot05.pdf, which makes much more sense.

Could you please explain, how the code needs to be changed in order to compute the AUPR values across the full [0,1] domain? I have some trouble doing this, since getOptMerge(se) does not provide a value for each possible edge in the network.

Thanks for any help on that matter.

Best regards, Isabella

zdk123 commented 4 years ago

Apologies for the delay. I have a branch pr with a fix, but have fallen very behind on keeping various patches up-to-date with main.

IsabellaMendler commented 4 years ago

Thanks for the fast reply! The generation of the correlated variables and the calculation of the AUPR works now fine.

However, I am also interested in the area under the ROC curve (AUC) and I've noticed that if I use the stars.roc function to determine the AUC, I only obtain very small values. If I use the example from the README.md page and call stars.roc(getOptMerge(se), graph, verbose=TRUE, plot=TRUE), I get the following result: Rplot07

Computing F1 scores, false positive rates and true positive rates....done. True Postive Rate: from 0 to 0.9527559 False Positive Rate: from 0 to 0.08686817 Area under Curve: 0.07907616 Maximum F1 Score: 0.947914

I suppose that the problem is again that the AUC is not computed over the full [0,1] domain...

Do you have any idea how this problem could be fixed?

I can of course use the ROC over lambda path (like in the example), but then edge predictions are not ranked according to edge stability, which I would like to avoid.

Thanks again and best wishes, Isabella

vincentprost commented 4 years ago

Nice to have fixed the AUPR calculation issue! Besides that problem, do you have any hint on what could account for the remaining AUPR differences between those computed for the scale-free topology (using amgut1 data and the zero inflated negative binomial) and those from the Figure 4 of the PLoS article (and also the different trends with respect to kappa values, as mentioned previously)?

Thanks! Vincent

suntzuisafterU commented 4 years ago

Hi,

First thanks for writing this library. It has been very useful and the synthetic data generation technique in particular has been helpful.

I also recently noticed the problem with synthetic data generation. I implemented an alternative solution before finding this thread. I am posting the alternative solution here because it easier to read.

data <- matrix(VGAM::qzinegbin(t(unif), munb=munbs, size=ks, pstr0=ps, ...), n, d, byrow=TRUE)

I am opening a pr, apologies I have only implemented the test and fix for the ZINB distribution.

Regardless of which implementation you choose, please publish the fix soon.