Open goldingn opened 7 years ago
It's late, so I'll just garble some words and hope they make vague sense...
1) It's great that there's interest. Guiding these decisions by what people want is ideal.
2) Just to check, by stacked SDMs do you mean using outputs from one SDM as an input to another e.g. macacs -> knowlseii? (ps anyone you know mapping that brazillian howler monkey malaria?)
3) As part of the discussion I guess it's worth mentioning how some of these things would fit without adding a new module type.
3a) Stacked SDMs. This could be a covariate module that takes a workflow object or the call that would make a workflow.
workflow(occurrence = KnowlesiiData,
covariate = StackedSDM({
workflow(occurrence = MacaqueData,
covariate = .,
process = .,
model = .,
output = .)
)
}),
process = .,
model = list(., ., .),
output = .)
Type thing. I believe this keeps the whole analysis reproducible in a way that anything takes workflow objects as arguments wouldn't.
3b) Ensemble.
I think 99% of use cases could be covered with a fairly simple model module (in fact it's on my to do list).
Just a model module that allows for multiple models from MachineLearn
, MaxEnt
and Biomod
with a few reasonable options for the top level learner would be pretty comprehensive. My aim was to add weighted average, elasticnet, perhaps a further call to caret (giving loads of options) and INLA (i.e. Sam's "stacked generalisation") as top level learners. Given the building blocks that are/will be in place, this isn't a massive amount of effort.
4) The other thing that came up at the workshop for example is bootstrapping. I believe replicate
+ modify
would handle this.
3a) By stacked SDM I mean running presence-absence SDMs for lots of different species, then summing the probabilities of presence to make predictions of species richness. I.e. like an ensemble, but for different species. So slightly different to your example & I'm not sure how it could be made to work with the existing zoon setup.
3b) Yeah, an option for the ensemble would be a type of meta module (#176) that allows the user to provide a bunch of model modules, like this:
model = ensemble(models = list(MaxEnt,
LogisticRegression,
MachineLearn("something"),
MachineLearn("something_else")),
weighting = "AUC")
(where weighting
could also be a stacker model like you would use but isn't common in SDM). Is that what you meant, or were you thinking of hard-coding the models, like the existing BiomodModel
module?
One downside of this is that it would be hard to investigate each of the component models, by passing them to output. I don't think zoon would record the module versions either. Though neither of these are critical.
P.S. no one I know is modelling the new malaria. Could re-use Freya and Catherine's knowlesi reservoir maps though?
There's interest in being able to do ensemble SDMs and stacked SDMs in zoon. We've also run into some awkwardness with thresholding and MESS masks, which need to be applied to rasters either before or after modelling.
In the past we've briefly thought about changing the core setup to enable things like ensemble models, but haven't settled on a way of integrating it into zoon's interoperable module types. We've just had a little brainstorming session here, and come up with something that might work well within what zoon already does. I'd be keen to hear your thoughts.
We could add an additional module type
modify
(name up for discussion) between themodel
andoutput
steps.modify
would take as input a list ofZoonModel
objects (returned by one or moremodel
modules) and return a list ofZoonModel
objects, of the same or different length. TheZoonModel
objects would then be pulled out of the list and passed to theoutput
modules.In the default case (i.e. a 'noModify'
modify
module could be used by default, for backwards compatibility), the input and output lists would be the same, so the workflow would run as it currently does. E.g.:(three outputs, one per model,
modify
has a 'noModify' as a default argument so need not be specified)If the user provided a
modify
module like 'threshold', that module would return a list ofZoonModel
object, with prediction methods modified to predict 1 above the threshold or 0 below. This could be handled by nesting oneZoonModel
inside another, or by adding a new decorator function. These could be chained to do multiple things. E.g.:(three outputs, one per model, with predictions set to 0 or 1 and clamped to the extreme values of the observed data)
If the user provided a
modify
module like 'ensemble', that module would return a list of only oneZoonModel
object, making predictions from the the ensemble. E.g.(one output, for an ensemble model making averaged predictions)
Similarly a 'stack'
modify
module would return a list with a singleZoonModel
object to predict the number of species (like an abundance model). Users could listmodify
modules if they wanted, to return both the original models, and the ensemble models:This would take a little work, but not too much. It would also make zoon a more attractive prospect to the ensemblers, and richness modellers.
What do you reckon?