Closed mfrasco closed 6 years ago
Hi Michael,
Of course I am always open to contributions.
First a question, how does this compare to the DeLong code that I have in src/delong.cpp
? This code also calculates the AUC (it is returned in the theta
item, see line 105), and they are supposed to be equivalent.
You may also want to check the RcppBootstrap branch: xrobin/pROC/RcppBootstrap. I have a C++ implementation for ROC curve and AUC that is much faster than the current one (in src
and inst/include
specifically). I never finalized it and it is still failing some tests when direction
is changed but otherwise should be working.
I guess the real question here is how to get rid of the overhead due to the construction of the ROC curve. I see the following things to think about:
roc()
object and then call auc()
. At the moment you can call auc
directly with the data. The AUC object contains a reference to the ROC curve (actually a copy) but I think it would be possible to either change that, or introduce a roc.auc
constructor (that might have an overhead too). Do you have an example code in mind the the user would call?Hi
Thanks for responding quickly. Yes, the code in src/delong.cpp
is the fast algorithm that I am referring to. I looked at the source code for pROC::auc
when a vector of responses and predictors was passed and saw that it called pROC::roc.default
. The interface that I imagine would be most useful is a user-facing function that calculates AUC without creating the ROC curve as an intermediate step. For an example of the code that the user would call, look at Metrics::auc
. Unfortunately, the implementation of that function struggles with integer overflow if the size of the data is large. Also, Metrics::auc
uses base::rank
which could be improved with a c++ version like the one provided by data.table::frank
. However, this performance speed-up is small. On my computer, with 1 million observation, base::rank
runs in 0.6 seconds and data.table::frank
runs in 0.1 second. With 10 million observations, it is 11 seconds and 1 second.
Since this method is restricted to the full AUC, it might make sense for this functionality to be provided in a separate function that exists outside of the primary pROC::roc
pipeline. What do you think? A primary use for this function would be to evaluate the performance of a model quickly when performing a search over the feature or hyper-parameter space.
I agree it should be separate, and actually it sounds like it should be a separate package.
pROC does a lot of checks on the inputs, and accepts a pretty large range of formats (numeric, ordered, dealing with NAs etc.), using arbitrary levels and direction for the comparison. This has of course a significant impact on the runtime that you'll probably want to avoid if you're interested in pure speed. It has little impact when dealing with large data sets, but I can see a usefulness for your code also when dealing with a large number of curves. I think it would be confusing to have some functions not check their inputs as thoroughly in pROC and I'd rather avoid that.
A separate package such as fastAUC
would give you the freedom to skip those checks altogether. Starting from the delong.cpp code it should be pretty straightforward. The function really only takes the case and control vectors as input and calculates the AUC. Or did you have a different implementation in mind?
The area under the ROC curve can be calculated directly from a vector of predictions and a vector of binary labels using the Mann-Whitney U Test. Since this algorithm does not require calculating the ROC curve, it can provide a significant performance increase. My benchmarks on show that, on 10 thousand observations, this algorithm is 1,000 times faster than calculating AUROC with your package (2100 milliseconds seconds vs 2.3 milliseconds).
Would you be interested in adding a C++ implementation of this algorithm to your package? The speedup that this algorithm provides would be valuable for users who need to evaluate hundreds to thousands of models (e.g. with a grid search over a feature / hyper-parameter space).
If you are interested in this contributions to your package, please let me know.