zoonproject / zoon_app_paper

A reproducible manuscript describing the zoon R package
Other
3 stars 7 forks source link

binary outputs #21

Closed goldingn closed 7 years ago

goldingn commented 8 years ago

Feng & Papes converted their armadillo suitability maps to binary presence/absence maps using the 5% omission error threshold (keeping 95% occurrences in presence area). To replicate their analysis, we should do the same.

We could define an output module which overwrites the predict method in the ZoonModel object to binarise it. I would be really nice if this could be chained into a map visualisation module. Chaining outputs is still under discussion though (see zoon issue)

goldingn commented 8 years ago

Here's a working version of a Binarise module that would work if we had Chain for outputs :

Binarise <- function (.model, .ras, threshold = 0.05) {
  # modify a model object to make binary predictions, based on an omission
  # threshold

  # get prediction threshold as a single numeric
  occ_data <- .model$data[.model$data$value == 1, ]
  cutoff <- quantile(occ_data$predictions, threshold)
  cutoff <- as.numeric(cutoff)

  # get old prediction code as a function
  old_fun_text <- sprintf("old_fun <- function (model, newdata) {%s}", 
                         .model$model$code)

  # add a couple of extra lines to apply the cutoff
  new_code <- sprintf("%s
                      p <- old_fun(model, newdata)
                      p <- ifelse(p > %s, 1, 0)
                      return(p)",
                      old_fun_text,
                      cutoff)

  # update the code & return
  .model$model$code <- new_code  
  return (list(.model = .model, .ras = .ras))
}

Execute this after running the Feng_Papes workflow in the ms:

bin <- Binarise(.model = Feng_Papes$model.output[[1]],
                .ras = Feng_Papes$process.output[[1]]$ras)

zoon:::GetModule('PrintMap', forceReproducible = FALSE)
PrintMap(.model = bin$.model,
         .ras = bin$.ras)

image

AugustT commented 8 years ago

I wonder if this could have been more easily solved by adding a threshold parameter to the PrintMap module, looks like a useful feature?

goldingn commented 8 years ago

In that case yes. There are various ways of calculating thresholds though, and lots of things that are done with thresholded predictions too. E.g. summarising total area considered suitable, identifying metapopulation structure, estimating populations at risk of disease.

timcdlucas commented 8 years ago

Just to put another POV out there... (This should possibly be migrated to the zoon issues at some point).

In my head this is part of the model. We've moved from f(environment) = p(occurrence) to f(environment, threshold) = occurrence.

So perhaps this is a case where chaining models makes sense. Chain(RandomForest, Binarise) trains a random forest which returns a probability and then binarises it.

This makes sense in a number of cases. Instead of writing a binarisePlot module and a binarisePerformanceMeasures module and a binariseInteractivePlot etc. etc. you binarise at the modelling stage and then most of the output modules will automatically handle the output. Especially as output modules should be able to handle binary output anyway as some ML methods only give binary outputs.

Also the examples in Nick's comment above all make more sense if the model handles the threshold then outputs can calculate total suitable area, etc.

The main problem I see with this view is that people may often want both the binarised and non binarised versions of the model. To get that they would probably have to do list(Chain(RandomForest, Binarise), RandomForest). But this then is a slightly odd list. The workflow isn't being split and compared with list...

Anyway, my 1 or 2 cents.

goldingn commented 8 years ago

That's a good point, thresholding is a weird one.

What would happen if someone used only the threshold model module, not in a chain though?

timcdlucas commented 8 years ago

Hmmm...

Neither are good answers but possibly one of:

  1. Defaults to fitting a logistic lm
  2. A threshold on the first environmental variable

I guess it's slightly hard to know as we haven't defined Chained model modules yet. Feels a little complicated. If they are to be like process modules they would have to accept and the return the same arguments, which I can't quite see how that will work for the first module in the Chain.

I'm coming round to it being part of the post processing (i.e. output). If outputs have Chains, the binarisePerformanceMeasures, binariseInteractivePlot issue doesn't hold.

timcdlucas commented 7 years ago

Was just about to open a new issue about this. Forgotten we'd already discussed it.

I think for now I'll add different methods to PrintMap. Want to get this going and submitted...

Cleanest that I can think of is to add a threshmethod argument. Makes it extendable.

timcdlucas commented 7 years ago

Just to add:

Let me know any other commonly used threshold calcs. Might as well try and enough to cover 95% of analyses.

timcdlucas commented 7 years ago

Fix in PR #36 here and in PR to modules. https://github.com/zoonproject/modules/pull/112

goldingn commented 7 years ago

Nice one Tim!