Closed goldingn closed 7 years ago
Here's a working version of a Binarise
module that would work if we had Chain
for outputs :
Binarise <- function (.model, .ras, threshold = 0.05) {
# modify a model object to make binary predictions, based on an omission
# threshold
# get prediction threshold as a single numeric
occ_data <- .model$data[.model$data$value == 1, ]
cutoff <- quantile(occ_data$predictions, threshold)
cutoff <- as.numeric(cutoff)
# get old prediction code as a function
old_fun_text <- sprintf("old_fun <- function (model, newdata) {%s}",
.model$model$code)
# add a couple of extra lines to apply the cutoff
new_code <- sprintf("%s
p <- old_fun(model, newdata)
p <- ifelse(p > %s, 1, 0)
return(p)",
old_fun_text,
cutoff)
# update the code & return
.model$model$code <- new_code
return (list(.model = .model, .ras = .ras))
}
Execute this after running the Feng_Papes
workflow in the ms:
bin <- Binarise(.model = Feng_Papes$model.output[[1]],
.ras = Feng_Papes$process.output[[1]]$ras)
zoon:::GetModule('PrintMap', forceReproducible = FALSE)
PrintMap(.model = bin$.model,
.ras = bin$.ras)
I wonder if this could have been more easily solved by adding a threshold
parameter to the PrintMap
module, looks like a useful feature?
In that case yes. There are various ways of calculating thresholds though, and lots of things that are done with thresholded predictions too. E.g. summarising total area considered suitable, identifying metapopulation structure, estimating populations at risk of disease.
Just to put another POV out there... (This should possibly be migrated to the zoon issues at some point).
In my head this is part of the model. We've moved from f(environment) = p(occurrence) to f(environment, threshold) = occurrence.
So perhaps this is a case where chaining models makes sense. Chain(RandomForest, Binarise)
trains a random forest which returns a probability and then binarises it.
This makes sense in a number of cases. Instead of writing a binarisePlot
module and a binarisePerformanceMeasures
module and a binariseInteractivePlot
etc. etc. you binarise at the modelling stage and then most of the output modules will automatically handle the output. Especially as output modules should be able to handle binary output anyway as some ML methods only give binary outputs.
Also the examples in Nick's comment above all make more sense if the model handles the threshold then outputs can calculate total suitable area, etc.
The main problem I see with this view is that people may often want both the binarised and non binarised versions of the model. To get that they would probably have to do list(Chain(RandomForest, Binarise), RandomForest)
. But this then is a slightly odd list. The workflow isn't being split and compared with list...
Anyway, my 1 or 2 cents.
That's a good point, thresholding is a weird one.
What would happen if someone used only the threshold model module, not in a chain though?
Hmmm...
Neither are good answers but possibly one of:
I guess it's slightly hard to know as we haven't defined Chained model modules yet. Feels a little complicated. If they are to be like process modules they would have to accept and the return the same arguments, which I can't quite see how that will work for the first module in the Chain.
I'm coming round to it being part of the post processing (i.e. output). If outputs have Chains, the binarisePerformanceMeasures
, binariseInteractivePlot
issue doesn't hold.
Was just about to open a new issue about this. Forgotten we'd already discussed it.
I think for now I'll add different methods to PrintMap. Want to get this going and submitted...
Cleanest that I can think of is to add a threshmethod
argument. Makes it extendable.
Just to add:
Let me know any other commonly used threshold calcs. Might as well try and enough to cover 95% of analyses.
Fix in PR #36 here and in PR to modules. https://github.com/zoonproject/modules/pull/112
Nice one Tim!
Feng & Papes converted their armadillo suitability maps to binary presence/absence maps using the 5% omission error threshold (keeping 95% occurrences in presence area). To replicate their analysis, we should do the same.
We could define an output module which overwrites the predict method in the ZoonModel object to binarise it. I would be really nice if this could be chained into a map visualisation module. Chaining outputs is still under discussion though (see zoon issue)