xrobin / pROC

Display and analyze ROC curves in R and S+
https://cran.r-project.org/web/packages/pROC/
GNU General Public License v3.0
121 stars 31 forks source link

Fast Calculation for Area Under ROC curve #17

Closed mfrasco closed 6 years ago

mfrasco commented 7 years ago

The area under the ROC curve can be calculated directly from a vector of predictions and a vector of binary labels using the Mann-Whitney U Test. Since this algorithm does not require calculating the ROC curve, it can provide a significant performance increase. My benchmarks on show that, on 10 thousand observations, this algorithm is 1,000 times faster than calculating AUROC with your package (2100 milliseconds seconds vs 2.3 milliseconds).

Would you be interested in adding a C++ implementation of this algorithm to your package? The speedup that this algorithm provides would be valuable for users who need to evaluate hundreds to thousands of models (e.g. with a grid search over a feature / hyper-parameter space).

If you are interested in this contributions to your package, please let me know.

xrobin commented 7 years ago

Hi Michael,

Of course I am always open to contributions.

First a question, how does this compare to the DeLong code that I have in src/delong.cpp? This code also calculates the AUC (it is returned in the theta item, see line 105), and they are supposed to be equivalent. You may also want to check the RcppBootstrap branch: xrobin/pROC/RcppBootstrap. I have a C++ implementation for ROC curve and AUC that is much faster than the current one (in src and inst/include specifically). I never finalized it and it is still failing some tests when direction is changed but otherwise should be working.

I guess the real question here is how to get rid of the overhead due to the construction of the ROC curve. I see the following things to think about:

mfrasco commented 7 years ago

Hi

Thanks for responding quickly. Yes, the code in src/delong.cpp is the fast algorithm that I am referring to. I looked at the source code for pROC::auc when a vector of responses and predictors was passed and saw that it called pROC::roc.default. The interface that I imagine would be most useful is a user-facing function that calculates AUC without creating the ROC curve as an intermediate step. For an example of the code that the user would call, look at Metrics::auc. Unfortunately, the implementation of that function struggles with integer overflow if the size of the data is large. Also, Metrics::auc uses base::rank which could be improved with a c++ version like the one provided by data.table::frank. However, this performance speed-up is small. On my computer, with 1 million observation, base::rank runs in 0.6 seconds and data.table::frank runs in 0.1 second. With 10 million observations, it is 11 seconds and 1 second.

Since this method is restricted to the full AUC, it might make sense for this functionality to be provided in a separate function that exists outside of the primary pROC::roc pipeline. What do you think? A primary use for this function would be to evaluate the performance of a model quickly when performing a search over the feature or hyper-parameter space.

xrobin commented 7 years ago

I agree it should be separate, and actually it sounds like it should be a separate package.

pROC does a lot of checks on the inputs, and accepts a pretty large range of formats (numeric, ordered, dealing with NAs etc.), using arbitrary levels and direction for the comparison. This has of course a significant impact on the runtime that you'll probably want to avoid if you're interested in pure speed. It has little impact when dealing with large data sets, but I can see a usefulness for your code also when dealing with a large number of curves. I think it would be confusing to have some functions not check their inputs as thoroughly in pROC and I'd rather avoid that.

A separate package such as fastAUC would give you the freedom to skip those checks altogether. Starting from the delong.cpp code it should be pretty straightforward. The function really only takes the case and control vectors as input and calculates the AUC. Or did you have a different implementation in mind?