Keep corresponding data, lda, and ts models together

weecology / MATSS-LDATS

Macroecological LDA analysis of time series

MIT License

3 stars 0 forks source link

In order to compute the full likelihood, we need a TS model plus the data and LDA that went into it. The way drake is set up right now, I've been reconstructing these relationships by parsing the names of individual TS models. This gets trickier if we want to be able to add 1) calculating the full likelihood and 2) generating data-prediction comparisons for document-term-abundances to the pipeline (which I think we do, because it's a lot of heavy lifting for a .Rmd).

The most straightforward way I see to do this is wrapping each object in a list of two elements, the object itself and a list of the objects upstream of it. So the output of run_LDA, for example would be list(lda = [LDA model set], upstream = list(data = [data])), and the output of run_TS would be list(ts = [TS model set], upstream = list(data = [data], lda = [LDA model set[)).

I'll try this out in a branch...

(If size became an issue the second list could be a list of names, and then we'd do some rlang stuff on it, but I think sticking an LDA_set + an empirical dataset to a TS model set won't be a big deal).

🙃 when scaling up to ~50 seeds, ~10 options for n topics, and 1000 iterations (which is not that many) this starts to break down _results objects and results in a cache that is slower to copy than I'd like. I've changed to (f302d1d) -

compute mean and median AICc for all TS models (over all iterations) and return that instead of all nit theta matrices
work mostly from model_info data frame with names and indices of associated objects instead of the actual objects
do all predictions post-hoc; don't run them by default as part of the pipeline. The function I currently have for doing this can work two ways: 1) loads the relevant lda, ts model, and data objects from the cache, meaning you can only call it from within an environment with the cache location already set up, or 2) you give it the ts model, data, and lda as arguments. You can also tell it to run predictions for a specific sim or set the seed and get a random one.

Prediction plots are now limited to the 25 most abundant species if the number of species exceeds 25. This is pretty flexible.

weecology / MATSS-LDATS

Keep corresponding data, lda, and ts models together #40