Evaluating combined LDA+TS fit

The Christensen et al paper, and LDATS, were set up to fit many LDA models > pick the best-fitting LDA model > run many TS models > pick the best fitting TS model. In applying this at medium scale we've found that the LDA selection step often selects models with very large k, and that the TS models often have trouble getting a meaningful fit to so many time series simultaneously.

The crossval, loo, loo-major, clean, lda-crossval, and full-likelihood branches of this repo and, most recently, the ldats-sandbox repo are developing ways to evaluate the combined fit of an LDA and TS model to observed data. Most of these (ha) are stale. Active development is ongoing in ldats-sandbox, because that work involves changing the shape of the drake pipeline and would be a significant change to this repo.

Here is the general rationale of the approaches we've tried and where we are now.

Full likelihood calculations

The LDA model fits a matrix of the term-topic probabilities. For every term-topic combination, what's the probability of that term occurring in a sample of that topic?

The TS model fits intercepts, coefficients, and changepoint locations to the time series of LDA topic proportions.

Combining the TS parameters with any covariates generates a topic-document probability matrix; what is the probability of a topic showing up in a given document?

Combining the topic-document and term-topic matrices gives a term-document multinomial distribution. We can then calculate the likelihood of the observed term-document frequencies given the term-document probabilities.

Because the TS models are Bayesian, each fit consists of 100s or 1000s of draws from the posterior distribution for the parameters. Each model therefore has 100s or 1000s of estimates of the likelihood corresponding to the draws.

We can also use the term-document multinomials to generate predicted term-document frequencies for a given set of parameters.

AICc (not using)

Branches in this repo (clean, others?) tried using AICc as the evaluation statistic. This continued to select models with very high k. Comparing observed abundances to predicted abundances raised concerns about overfitting, so we moved to a crossvalidation approach.

Initial crossvalidation

Crossvalidation in this repo tried various ways to withhold random and sequential subsets of data, including some novice decisionmaking by RMD. This settled on leave-one-out crossvalidation over the entire timeseries. These efforts are stale and probably extremely fragile.

Leave one out crossvalidation

Leave out one year (plus a two year buffer on each side); train the models; test using the withheld year. Repeat with every year of the time series.

This generates a suite of models, one for each training/test set, under the umbrella of every qualitative model configuration. Calculating the overall likelihood of all the models under a particular umbrella is a little tricky.

JS and RMD settled on stitching together timeseries by combining one set of parameters+testdata=likelihood values from a model trained on data missing each year in the timeseries and getting an overall likelihood across the timeseries, and repeating this for every draw from the posterior. Because there might be covariance in the draws, we shuffle which draw we use from each year.

At this point keeping track of which models from under which umbrellas belonged together became intractable. MATSS -LDATS is organized around the dataset, which in this case fractures into a dataset for every timestep. RMD moved to a new, smaller repo, ldats-sandbox, to continue development with a different drake structure. This is a lot more straightforward and hopefully robust than the early efforts in this repo.

Also, LDATS was initially set up to run LDA models on a suite of n seeds, but not to let the user specify which specific seed to use. To speed up compute/maximize parallelization in drake, RMD created a patch branch off of weecology/LDATS to allow the user to specify a particular seed to use. The crossvalidation functions rely on this version of LDATS (and not the CRAN or master).

weecology / MATSS-LDATS