xrobin / pROC

Display and analyze ROC curves in R and S+
https://cran.r-project.org/web/packages/pROC/
GNU General Public License v3.0
123 stars 31 forks source link

Error in delongPlacements while calculating DeLong's theta #25

Closed mirshahin closed 6 years ago

mirshahin commented 6 years ago

I was running a piece of code and my code threw this error message:

Error in delongPlacements(roc) : A problem occured while calculating DeLong's theta: got 0.50057161522129678399 instead of 0.50032663726931247972. This is a bug in pROC, please report it to the maintainer.

Does anyone know what I should do? Thanks

xrobin commented 6 years ago

Thank you for your report. Can you please provide a reproducible example that produces this error? Also please report the output of sessionInfo().

xrobin commented 6 years ago

Just to provide a bit more details about the error:

pROC calculates the AUC with the trapezoidal method. This is the AUC you get when you call auc(roc). With your data the AUC value was 0.50032663726931247972.

When using the DeLong method (for roc.test, variance etc), the AUC is also calculated with an other method (similar to the Wilcoxon/Mann-Whitney statistic). With your data the result is 0.50057161522129678399.

These two values should be identical, at least down to something close to the floating point precision of your machine, typically below 10^-8. To be on the safe side, pROC checks this assumption after calculating the DeLong AUC.

For some reason with your data the error was around 0.5%, in the 10^4 range. This is far too high to be caused by numerical approximations of floating point arithmetic, and highly suggestive of a bug, either in the trapezoid or in the DeLong calculation of the AUC.

mirshahin commented 6 years ago

R version 3.4.4 (2018-03-15) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] nricens_1.5 survival_2.41-3 lazyeval_0.2.1 ipred_0.9-6 pROC_1.11.0 caret_6.0-78 bindrcpp_0.2 dplyr_0.7.4 plyr_1.8.4 reshape2_1.4.3 ggplot2_2.2.1
[12] lattice_0.20-35

loaded via a namespace (and not attached): [1] tidyselect_0.2.4 purrr_0.2.4 kernlab_0.9-25 splines_3.4.4 colorspace_1.3-2 stats4_3.4.4 prodlim_1.6.1 rlang_0.2.0 ModelMetrics_1.1.0 [10] pillar_1.2.1 withr_2.1.2 foreign_0.8-69 glue_1.2.0 foreach_1.4.4 bindr_0.1.1 dimRed_0.1.0 lava_1.6 robustbase_0.92-8 [19] stringr_1.3.0 timeDate_3043.102 munsell_0.4.3 gtable_0.2.0 recipes_0.1.2 codetools_0.2-15 psych_1.7.8 parallel_3.4.4 class_7.3-14
[28] DEoptimR_1.0-8 broom_0.4.3 Rcpp_0.12.16 scales_0.5.0 CVST_0.2-1 mnormt_1.5-5 stringi_1.1.7 RcppRoll_0.2.2 ddalpha_1.3.1.1
[37] grid_3.4.4 tools_3.4.4 magrittr_1.5 tibble_1.4.2 tidyr_0.8.0 DRR_0.0.3 pkgconfig_2.0.1 MASS_7.3-49 Matrix_1.2-12
[46] lubridate_1.7.2 gower_0.1.2 assertthat_0.2.0 iterators_1.0.9 R6_2.2.2 rpart_4.1-13 sfsmisc_1.1-2 nnet_7.3-12 nlme_3.1-131.1
[55] compiler_3.4.4

mirshahin commented 6 years ago

The code builds different size models with two sets of covariates (clinical features and biomarker values) in several loops. I use glm from caret, metric = "ROC". This is for a clinical test so I have to limit the specificity and get the highest sensitivity for that specificity. I use pROC to reconstruct the ROC curve. Then I calculate the NRI to see if the clinical features made any improvement in the reclassification of the patients we are interested in.

I know the indices of the loops that caused the code to stop. What do you exactly need?

ogroendal commented 6 years ago

Just to chime in. I just got the same issue today. I had some code that ran on the very same data, unchanged. Today I upgraded my R version from 3.2.* to 3.4.4, this seems to have caused the problem (I also redownloaded pROC, but the previous download was also within the last few weeks.)

The error that I got was :

A problem occured while calculating DeLong's theta: got 0.79243483788938329226 instead of 0.79307056579783852257. This is a bug in pROC, please report it to the maintainer

I hope this information helps!

xrobin commented 6 years ago

This seems very concerning and I will give it the highest priority to fix this issue.

This bit of pROC code is thoroughly tested by automated unit tests, which are currently all successfully passing on the CRAN.

This error is obviously triggered by a specific set of data and arguments, so in order to move on I need a reproducible example that triggers this bug. Please provide the exact code and data that triggers this error.

xrobin commented 6 years ago

@ogroendal and @mirshahin have you managed to resolve the issue? I need a reproducible example to move forward with this issue.

eudesbarbosa commented 6 years ago

Hello,

I had the same issue. I included some files that reproduce the error on a small dataset I was playing with. Please let me know if you need more information. The function is under development.

_R version 3.2.3 (2015-12-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=de_DE.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] pROC_1.11.0 gdata_2.18.0

loaded via a namespace (and not attached): [1] colorspace_1.3-2 scales_0.5.0 lazyeval_0.2.1 plyr_1.8.4 tools_3.2.3 pillar_1.1.0 gtable_0.2.0 tibble_1.4.2 [9] yaml_2.1.18 Rcpp_0.12.16 ggplot2_2.2.1 grid_3.2.3 rlang_0.2.0 munsell_0.4.3 gtools3.5.0 pROC_debug.zip

xrobin commented 6 years ago

Thanks a lot eudesbarbosa, this is extremely useful.

The following seems to be a pretty minimal reproducible example:

library(pROC)
roc(
c("A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", 
"A", "A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "A", "A", 
"A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "A", "A", "A", 
"A", "A", "A", "A", "A", "B", "B", "B", "B"),
c(1, 0.98, 1, 0.98, 1, 1, 0.98, 1, 0.14, 0.14, 0.2, 0.2, 1, 1, 
1, 0.98, 1, 0.98, 1, 1, 0.16, 0.16, 0.04, 0.12, 1, 1, 1, 1, 1, 
1, 1, 1, 0.14, 0.14, 0.14, 0.14, 0.98, 1, 1, 0.98, 1, 1, 1, 1, 
0.14, 0.2, 0.16, 0.24, 0.98, 1, 1, 1, 1, 1, 1, 1, 0.18, 0.12, 
0.16, 0.22), ci=TRUE, percent=TRUE)

I will take it from there.

xrobin commented 6 years ago

Until I find out what is going on exactly, and so that you can carry on, setting percent = FALSE appears to be a workaround.

xrobin commented 6 years ago

This seems to be a minimal example:

r <- roc(c("A", "B", "A", "B"), c(1, 0, 1, 0), percent = TRUE)
ci(r)

It seems to be important to have direction=">". The following curve also seems to produce erroneous NaN results, and is probably related:

r <- roc(c("A", "B"), c(0, 1))
ci(r)
xrobin commented 6 years ago

It turns out this specific combination wasn't tested. Fixed in commit 41869d8c67bdb990be41538736b4b7f75704350e (automated tests should fail from now on)

xrobin commented 6 years ago

The bug is now be fixed on the master branch. I will aim to push it to CRAN during the weekend or early next week.

The reason for the bug was a broken conversion of the roc curve from percent to fractions for internal calculations, followed by a broken check that the DeLong code produced the correct AUC, which in combination caused an error in the specific case where percent=TRUE and direction=">". The check was introduced in pROC version 1.9.1. The bug in the conversion from percent to fraction was present in earlier versions, however it never affected calculations, which is why it was left unnoticed.

@eudesbarbosa thanks a lot for helping pinpointing the bug with a reproducible example. I tried to run your code and I get unrelated errors about missing errormult. If you want you can installing the development version of pROC with:

install.packages("devtools")
devtools::install_github("xrobin/pROC")
ogroendal commented 6 years ago

Hey @xrobin I just installed the newest version from the github repo, but it seems to give the same error as before.

Error in delongPlacements(roc) : 
  A problem occured while calculating DeLong's theta: got 0.74011194029850746468 instead of 0.74048507462686563585. This is a bug in pROC, please report it to the maintainer.

This is my sessionInfo :

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] bindrcpp_0.2.2  doBy_4.6-1      gridExtra_2.3   survminer_0.4.2 ggpubr_0.1.6    survival_2.42-3 ggplot2_2.2.1   pROC_1.11.9     purrr_0.2.4     lubridate_1.7.4 dplyr_0.7.4     magrittr_1.5   
[13] nvimcom_0.9-71  colorout_1.2-0 

loaded via a namespace (and not attached):
 [1] zoo_1.8-1           reshape2_1.4.3      splines_3.4.4       lattice_0.20-35     colorspace_1.3-2    survMisc_0.5.4      rlang_0.2.0         pillar_1.2.2        foreign_0.8-69      glue_1.2.0         
[11] withr_2.1.2         bindr_0.1.1         plyr_1.8.4          stringr_1.3.0       munsell_0.4.3       gtable_0.2.0        devtools_1.13.5     psych_1.8.3.3       memoise_1.1.0       labeling_0.3       
[21] knitr_1.20          curl_3.2            parallel_3.4.4      broom_0.4.4         Rcpp_0.12.16        xtable_1.8-2        scales_0.5.0        cmprsk_2.2-7        km.ci_0.5-2         mnormt_1.5-5       
[31] digest_0.6.15       stringi_1.1.7       KMsurv_0.1-5        grid_3.4.4          tools_3.4.4         lazyeval_0.2.1      tibble_1.4.2        tidyr_0.8.0         pkgconfig_2.0.1     MASS_7.3-49        
[41] Matrix_1.2-12       data.table_1.10.4-3 assertthat_0.2.0    httr_1.3.1          R6_2.2.2            nlme_3.1-131        git2r_0.21.0        compiler_3.4.4
xrobin commented 6 years ago

@ogroendal can you please also send the code and data that trigger this error? Your sessionInfo doesn't point to anything obvious.

eudesbarbosa commented 6 years ago

@xrobin , thanks for the quick feedback and I'm glad I could help a little =)

The errormult error is because I didn't included that particular function in the code I sent you. Nothing to worry about, it runs fine with the "percent = FALSE".

xrobin commented 6 years ago

@eudesbarbosa can you please confirm if the fix working for you (ie can you use "percent=TRUE" now)?

eudesbarbosa commented 6 years ago

After installing the development version of pROC, I no longer have the error. It seems to be working fine with "percent = TRUE".

xrobin commented 6 years ago

@eudesbarbosa thanks for the confirmation, and I'm glad to hear it works for you now. So we're really looking at two different issues here.

@mirshahin @ogroendal please provide a reproducible example. I've been asking for it for a month now, this bug is not going to get fixed until I can reproduce it. Please post the command you're running with the exact data that you feed into it (use dump to dump it as a command that can be used to re-generate it).

xrobin commented 6 years ago

With some googling and a bit of luck, I found a cached page of the automated checks of an old version of a CRAN package triggering that error and could isolate a reproducible example from it:

response <- c(2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2)
predictor <- c(0.960602681556147, 0.0794407386056549, 0.144842404246611, 
0.931816485855784, 0.931816485855784, 0.97764041048215, 0.653549466997938699464, 
0.796401132206396, 0.427720540184519, 0.811278021288732, 0.0188323116581187, 
0.653549466997938588442, 0.653549466997938477419, 0.959111701445925, 0.931816485855784, 
0.663663279418747, 0.800100838413179, 0.780456095511079)
r <- roc(response, predictor)
ci(r)
var(r)

There are 3 nearly but not exactly identical predictor values (which dump doesn't preserve). Due to the values being so close to each other, they are effectively identical to the threshold chosen by pROC:

> r$thresholds[6] == predictor
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13]  TRUE FALSE FALSE FALSE FALSE FALSE
> r$thresholds[7] == predictor
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE

Also different algorithms handle this case differently:

> auc(r)
Area under the curve: 0.7727
> auc(roc(response, predictor, algorithm=2))
Area under the curve: 0.7792
xrobin commented 6 years ago

The bug was in fact in the selection of the threshold. When two predictor values are too close, their mean could not be represented in the ieee 754 arithmetic and would be equal to one or the other value, pretty much arbitrarily depending on the implementation.

> a <- c(0.65354946699793847742, 0.65354946699793858844)
> print(mean(a), digits = 20)
[1] 0.65354946699793847742
> mean(a) == a
[1]  TRUE FALSE

Commit 87fceab1a8203ffb1335fade4194f6bc7874b457 should fix the issue by selecting the appropriate threshold, only in the case a threshold is identical to some of the predictor values in the first place.

This fix is modifying a core algorithm of pROC (the threshold selection) and I need to do more tests before I push this to CRAN in order to make sure it doesn't break something elsewhere in pROC.

For now the build seems to be passing the automated tests, with tests for both direction arguments. Some cases may not be handle properly still and will result in a new error message. I can't think of a way this would be triggered, but who knows. As we are talking about near ties, close to the precision limit of the ieee 754 arithmetic, they are effectively ties and I will probably convert those errors to warnings later on.

ealeph commented 6 years ago

Hello,

Also received the error today: Error in delongPlacements(roc) : A problem occured while calculating DeLong's theta: got 0.74135039370078736898 instead of 0.74135826771653545730. This is a bug in pROC, please report it to the maintainer.

pROC worked fine with this exact data until I needed to uninstall and reinstall R 5 days ago (working now with version 3.4.0)

I'm running several roc() and have this problem only with one of them.

This is my code (data attached): roc2<-roc(proc_problem$var1, proc_problem$var_works) plot=T,add=T,lty=2,ci=T,levels=c(0, 1)) roc3<-roc(proc_problem$var1, proc_problem$var_problem, plot=T, add=T,lty=3,ci=T,levels=c(0, 1))

The "percent = FALSE" fix didn't work for me. proc_problem.txt

Thanks!

xrobin commented 6 years ago

@ealeph could you please test with the latest version on github to confirm if the fix works for you?

if (! requireNamespace("devtools")) install.packages("devtools")
devtools::install_github("xrobin/pROC")

I must say your data works fine for me. It could be because it wasn't saved with enough precision in the CSV file, or something else.

> proc_problem <- read.csv("proc_problem.txt")
> roc3<-roc(proc_problem$var1, proc_problem$var_problem, plot=T, lty=3,ci=T,levels=c(0, 1))
> roc3

Call:
roc.default(response = proc_problem$var1, predictor = proc_problem$var_problem,     levels = c(0, 1), ci = T, plot = T, lty = 3)

Data: proc_problem$var_problem in 1016 controls (proc_problem$var1 0) < 125 cases (proc_problem$var1 1).
Area under the curve: 0.7414
95% CI: 0.6915-0.7914 (DeLong)

If my analysis is correct, the bug shouldn't be dependent on a specific R version. pROC has been checking for this inconsistency since pROC 1.9.1 released more than a year ago (Feb. 2017). Unless you were using a very outdated version I don't see why you didn't get the same error. So I would really appreciate if you could check that the fix indeed solves your problem. As you can read I'm pretty much shooting in the dark here.

ealeph commented 6 years ago

Dear Xavier,

Thanks for getting back to me so quickly.

I cannot download from Github, I keep getting: "Installation failed: Timeout was reached: Connection timed out after 10015 milliseconds". I tried multiple times both in RStudio and directly in R. I can install packages the regular way from CRAN. Is there another way to download the version from Github and install it in R? (I'm not a data scientist, but an epidemiologist using R for statistical analyses)

This morning I updated R to version 3.4.4 and downloaded again pROC in case it does the trick but still not working.

I attach to this email the RData version of the data file I sent you yesterday - maybe it will help you find what's wrong.

Thanks again, Effie

On Wed, May 2, 2018 at 6:12 PM, Xavier Robin notifications@github.com wrote:

@ealeph https://github.com/ealeph could you please test with the latest version on github to confirm if the fix works for you?

if (! requireNamespace("devtools")) install.packages("devtools") devtools::install_github("xrobin/pROC")

I must say your data works fine for me. It could be because it wasn't saved with enough precision in the CSV file, or something else.

proc_problem <- read.csv("proc_problem.txt") roc3<-roc(proc_problem$var1, proc_problem$var_problem, plot=T, lty=3,ci=T,levels=c(0, 1)) roc3

Call: roc.default(response = proc_problem$var1, predictor = proc_problem$var_problem, levels = c(0, 1), ci = T, plot = T, lty = 3)

Data: proc_problem$var_problem in 1016 controls (proc_problem$var1 0) < 125 cases (proc_problem$var1 1). Area under the curve: 0.7414 95% CI: 0.6915-0.7914 (DeLong)

If my analysis is correct, the bug shouldn't be dependent on a specific R version. pROC has been checking for this inconsistency since pROC 1.9.1 released more than a year ago (Feb. 2017). Unless you were using a very outdated version I don't see why you didn't get the same error. So I would really appreciate if you could check that the fix indeed solves your problem. As you can read I'm pretty much shooting in the dark here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xrobin/pROC/issues/25#issuecomment-386033326, or mute the thread https://github.com/notifications/unsubscribe-auth/ALznIldtx-5b-R3cFk8B99YWCHrtLLJKks5tudrcgaJpZM4TA6nQ .

xrobin commented 6 years ago

@ealeph you can clone the repository locally with git clone https://github.com/xrobin/pROC.git or download the code as a zip file, and then use devtool's install_local, pointing to the folder where the source code is:

devtools::install_local("pROC") 
#or
devtools::install_local("pROC-master") 

If you have RStudio you can also do File > Open Project and point to the default.Rproj file within the git repository that you just cloned/downloaded. Once it is open you can use menu Build > Install and Restart.

The RData file you sent wasn't copied to the GitHub issue so I can't test unfortunately.

xrobin commented 6 years ago

I am preparing a CRAN release. I tweaked the error message a bit, and made it dump some relevant data into a file in order to make it easier to debug if the problem persists.

xrobin commented 6 years ago

In addition, I ran the full test suite (including testthat tests with RUN_SLOW_TESTS and man page examples with \dontrun) with a stop() in if (any(o <- outer(thresholds, predictor,==))) { to make sure this is not triggered in normal use. All tests ran fine so I am happy it shouldn't impact the calculations in normal cases.

xrobin commented 6 years ago

pROC 1.12.0 is on CRAN now. Please update:

 install.packages("pROC")

It may take a few hours for Mac/Windows binaries to be available. If you still get the error, please open a new issue with the output of sessionInfo() and either steps to reproduce or the contents of the pROC_bug.RData dump file (make sure it doesn't contain sensitive data first before posting it as it will be available publicly).

ealeph commented 6 years ago

Hello,

Sorry for the late reply - I was away from the office for a few days. I just installed pROC 1.12.0 from CRAN and it works perfectly as before (Checked on R versions 3.4.0 and 3.4.4).

Thank you!

On Sat, May 5, 2018 at 6:06 PM, Xavier Robin notifications@github.com wrote:

pROC 1.12.0 is on CRAN now. Please update:

install.packages("pROC")

It may take a few hours for Mac/Windows binaries to be available. If you still get the error, please open a new issue with the output of sessionInfo() and either steps to reproduce or the contents of the pROC_bug.RData dump file (make sure it doesn't contain sensitive data first before posting it as it will be available publicly).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xrobin/pROC/issues/25#issuecomment-386815889, or mute the thread https://github.com/notifications/unsubscribe-auth/ALznIqpJYqKZbfqY9rCHxxFcRhOO0dLrks5tvc4CgaJpZM4TA6nQ .

xrobin commented 6 years ago

Sure! By the way please make sure you get 1.12.1 because 1.12.0 has a major performance regression.