zoonproject / zoon

The zoon R package
Other
61 stars 13 forks source link

add 'modify' module type #392

Open goldingn opened 7 years ago

goldingn commented 7 years ago

There's interest in being able to do ensemble SDMs and stacked SDMs in zoon. We've also run into some awkwardness with thresholding and MESS masks, which need to be applied to rasters either before or after modelling.

In the past we've briefly thought about changing the core setup to enable things like ensemble models, but haven't settled on a way of integrating it into zoon's interoperable module types. We've just had a little brainstorming session here, and come up with something that might work well within what zoon already does. I'd be keen to hear your thoughts.


We could add an additional module type modify (name up for discussion) between the model and output steps. modify would take as input a list of ZoonModel objects (returned by one or more model modules) and return a list of ZoonModel objects, of the same or different length. The ZoonModel objects would then be pulled out of the list and passed to the output modules.

In the default case (i.e. a 'noModify' modify module could be used by default, for backwards compatibility), the input and output lists would be the same, so the workflow would run as it currently does. E.g.:

workflow(occurrence = .,
         covariate = .,
         process = .,
         model = list(., ., .),
         output = .)

(three outputs, one per model, modify has a 'noModify' as a default argument so need not be specified)

If the user provided a modify module like 'threshold', that module would return a list of ZoonModel object, with prediction methods modified to predict 1 above the threshold or 0 below. This could be handled by nesting one ZoonModel inside another, or by adding a new decorator function. These could be chained to do multiple things. E.g.:

workflow(occurrence = .,
         covariate = .,
         process = .,
         model = list(., ., .),
         modify = chain(threshold(0.5), clamp),
         output = .)

(three outputs, one per model, with predictions set to 0 or 1 and clamped to the extreme values of the observed data)

If the user provided a modify module like 'ensemble', that module would return a list of only one ZoonModel object, making predictions from the the ensemble. E.g.

workflow(occurrence = .,
         covariate = .,
         process = .,
         model = list(., ., .),
         modify = ensemble(weight = TRUE),
         output = .)

(one output, for an ensemble model making averaged predictions)

Similarly a 'stack' modify module would return a list with a single ZoonModel object to predict the number of species (like an abundance model). Users could list modify modules if they wanted, to return both the original models, and the ensemble models:

workflow(occurrence = .,
         covariate = .,
         process = .,
         model = list(., ., .),
         modify = list(noModify, ensemble(weight = TRUE)),
         output = .)

This would take a little work, but not too much. It would also make zoon a more attractive prospect to the ensemblers, and richness modellers.

What do you reckon?

timcdlucas commented 7 years ago

It's late, so I'll just garble some words and hope they make vague sense...

1) It's great that there's interest. Guiding these decisions by what people want is ideal.

2) Just to check, by stacked SDMs do you mean using outputs from one SDM as an input to another e.g. macacs -> knowlseii? (ps anyone you know mapping that brazillian howler monkey malaria?)

3) As part of the discussion I guess it's worth mentioning how some of these things would fit without adding a new module type.

3a) Stacked SDMs. This could be a covariate module that takes a workflow object or the call that would make a workflow.

workflow(occurrence = KnowlesiiData,
         covariate = StackedSDM({
           workflow(occurrence = MacaqueData,
                covariate = .,
                process = .,
                model = .,
                output = .)
           )
         }),
         process = .,
         model = list(., ., .),
         output = .)

Type thing. I believe this keeps the whole analysis reproducible in a way that anything takes workflow objects as arguments wouldn't.

3b) Ensemble. I think 99% of use cases could be covered with a fairly simple model module (in fact it's on my to do list). Just a model module that allows for multiple models from MachineLearn, MaxEnt and Biomod with a few reasonable options for the top level learner would be pretty comprehensive. My aim was to add weighted average, elasticnet, perhaps a further call to caret (giving loads of options) and INLA (i.e. Sam's "stacked generalisation") as top level learners. Given the building blocks that are/will be in place, this isn't a massive amount of effort.

4) The other thing that came up at the workshop for example is bootstrapping. I believe replicate + modify would handle this.

goldingn commented 7 years ago

3a) By stacked SDM I mean running presence-absence SDMs for lots of different species, then summing the probabilities of presence to make predictions of species richness. I.e. like an ensemble, but for different species. So slightly different to your example & I'm not sure how it could be made to work with the existing zoon setup.

3b) Yeah, an option for the ensemble would be a type of meta module (#176) that allows the user to provide a bunch of model modules, like this:

model = ensemble(models = list(MaxEnt,
                               LogisticRegression,
                               MachineLearn("something"),
                               MachineLearn("something_else")),
                 weighting  = "AUC")

(where weighting could also be a stacker model like you would use but isn't common in SDM). Is that what you meant, or were you thinking of hard-coding the models, like the existing BiomodModel module?

One downside of this is that it would be hard to investigate each of the component models, by passing them to output. I don't think zoon would record the module versions either. Though neither of these are critical.


P.S. no one I know is modelling the new malaria. Could re-use Freya and Catherine's knowlesi reservoir maps though?