xrobin / pROC

Display and analyze ROC curves in R and S+
https://cran.r-project.org/web/packages/pROC/
GNU General Public License v3.0
118 stars 31 forks source link

roc.utils.thresholds function #77

Closed marvinquiet closed 4 years ago

marvinquiet commented 4 years ago

Describe the bug A clear and concise description of what the bug is.

Line 121: if (thresholds[tie.idx] == unique.candidates[tie.idx - 1]) { When tie.idx = 1, this statement will throw an error "argument is of length zero".

To Reproduce Steps to reproduce the behavior:

  1. What packages were loaded? Run sessionInfo() and report the output.
    
    R version 3.6.3 (2020-02-29)
    Platform: x86_64-apple-darwin15.6.0 (64-bit)
    Running under: macOS Catalina 10.15.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 parallel tools stats graphics grDevices utils datasets methods base

other attached packages: [1] data.table_1.12.8 fgsea_1.12.0 Rcpp_1.0.4.6 EnhancedVolcano_1.4.0 RColorBrewer_1.1-2
[6] monocle3_0.2.1 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.3 BiocParallel_1.20.1
[11] matrixStats_0.56.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 IRanges_2.20.2 S4Vectors_0.24.4
[16] Biobase_2.46.0 BiocGenerics_0.32.0 pROC_1.16.2 forcats_0.5.0 stringr_1.4.0
[21] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.1 tidyverse_1.3.0
[26] dichromat_2.0-0 ggrepel_0.8.2 reshape2_1.4.4 gplots_3.0.3 ggplot2_3.3.0
[31] dplyr_0.8.5

loaded via a namespace (and not attached): [1] colorspace_1.4-1 deldir_0.1-25 ellipsis_0.3.1 class_7.3-17 XVector_0.26.0 fs_1.4.1
[7] rstudioapi_0.11 proxy_0.4-24 farver_2.0.3 RSpectra_0.16-0 fansi_0.4.1 lubridate_1.7.8
[13] xml2_1.3.2 splines_3.6.3 codetools_0.2-16 jsonlite_1.6.1 broom_0.5.6 dbplyr_1.4.3
[19] pheatmap_1.0.12 uwot_0.1.8 BiocManager_1.30.10 compiler_3.6.3 httr_1.4.1 backports_1.1.7
[25] assertthat_0.2.1 Matrix_1.2-18 cli_2.0.2 igraph_1.2.5 coda_0.19-3 gtable_0.3.0
[31] glue_1.4.1 GenomeInfoDbData_1.2.2 RANN_2.6.1 gmodels_2.18.1 fastmatch_1.1-0 slam_0.1-47
[37] cellranger_1.1.0 raster_3.1-5 vctrs_0.3.0 spdep_1.1-3 gdata_2.18.0 nlme_3.1-148
[43] DelayedMatrixStats_1.8.0 rvest_0.3.5 lifecycle_0.2.0 irlba_2.3.3 gtools_3.8.2 LearnBayes_2.15.1
[49] MASS_7.3-51.6 zlibbioc_1.32.0 scales_1.1.1 hms_0.5.3 expm_0.999-4 leidenbase_0.1.0
[55] gridExtra_2.3 stringi_1.4.6 e1071_1.7-3 caTools_1.18.0 boot_1.3-25 spData_0.3.5
[61] rlang_0.4.6 pkgconfig_2.0.3 bitops_1.0-6 lattice_0.20-41 sf_0.9-3 labeling_0.3
[67] tidyselect_1.1.0 RcppAnnoy_0.0.16 plyr_1.8.6 magrittr_1.5 R6_2.4.1 generics_0.0.2
[73] DBI_1.1.0 pillar_1.4.4 haven_2.3.0 withr_2.2.0 units_0.6-6 RCurl_1.98-1.2
[79] sp_1.4-2 modelr_0.1.8 crayon_1.3.4 KernSmooth_2.23-17 viridis_0.5.1 grid_3.6.3
[85] readxl_1.3.1 reprex_0.3.0 digest_0.6.25 classInt_0.4-3 pbmcapply_1.5.0 munsell_0.5.0
[91] viridisLite_0.3.0


2. What command did you run?

pROC_obj <- roc(labels, predictors, direction=c("<")) coords(pROC_obj, ret = c("tpr", "fpr"), transpose=FALSE) # this causes the error



3. What data did you use? Use `save(myData, file="data.RData")` or `save.image("data.RData")`
4. What error or output did you get?
_argument is of length zero_

**Expected behavior**
A clear and concise description of what you expected to happen.

My expectation was to get the TRP and FPR list from coords() function.

**Additional context**
Add any other context about the problem here.

I just wonder if we could change this line of code into:
`if (tie.idx > 1 & thresholds[tie.idx] == unique.candidates[tie.idx - 1]) {`

BTW, thank you for your efforts in implementing this R package!!
xrobin commented 4 years ago

Hi, thanks for the report!

I don't see how ties.idx could possibly be empty. Could you attach the data that triggers this problem? It would help me find out what's going on exactly and hence fix it.

This piece of code is pretty tricky. It's here to handle cases where you can't represent the mean of two numbers due to the limited precision of computer numbers. But this precision so really high, it's extremely unlikely to happen in the first place. In any case it's quite important to select the right threshold to be numerically accurate, so I'll need a test case to be able to do it right.

As a quick workaround you could try to add a very small jitter to your data

predictors <- jitter(predictors, factor=1e-12)

which will break the near-ties. Make sure to adjust the factor low enough in order to not affect your data in a meaningful way. Here I used 1e-12, I doubt you have numbers down to that precision.

marvinquiet commented 4 years ago

Hi, Thank you for your prompt reply!

Attached please find the data. There are two variables in the data named values and labels, which I used the labels as a response while values as predictors. pROC_test.RData.gz

At first, I thought it was caused by when tie.idx=1, then tie.idx-1=0, however, R index starts at 1.

> if (thresholds[1] == unique.candidates[0]) {print("test")}
Error in if (thresholds[1] == unique.candidates[0]) { : 
  argument is of length zero
> length(thresholds)
[1] 18819
> length(unique.candidates)
[1] 18818

It seems that adding jitters does not work for this tied problem. Please let me know if anything else I could help.

xrobin commented 4 years ago

I can see that you have a -Inf value in values. Indeed jitter is not going to help.

range(values)
[1]        -Inf 26.96443

Infinite values are generally disallowed in ROC curve. The reason is that a ROC curve must test all thresholds from -Inf to +Inf. It is therefore difficult to compare your -Inf value with the -Inf threshold.

Although it is possible to compute -Inf <= -Inf in R, when supplied with an infinite value, most packages may generate an "invalid" ROC curve that may not hit the points (0,0) or (1, 1), or worse generate an inaccurate ROC curve. This will wreak havoc in particular on the AUC calculations and is generally undesirable. In order to avoid that pROC rejects inputs containing infinite values.

At this point I don't know why it pROC didn't display an error message for your data. I will investigate that.

Regarding your analysis, you should probably remove the infinite value from your data like you would remove a missing value.

marvinquiet commented 4 years ago

Yes, I guess that's why it generates two ties there. I can definitely remove the -Inf and try it again! Thanks so much for your quick reply and support!