weecology / LDATS

Latent Dirichlet Allocation coupled with Bayesian Time Series analyses
https://weecology.github.io/LDATS
Other
25 stars 5 forks source link

Data API design questions #119

Closed ethanwhite closed 5 years ago

ethanwhite commented 5 years ago

The package currently consumes two data objects:

1) a document term table that is cross-tab where the time component is implied by position 2) a document covariate table that is long and contains both an explicit timeseries component and any associated covariates

Assembling these two objects for commonly formatted data would require some potentially fragile work and so I'm wondering if it's worth having a conversation about the data API at least for the top-level LDA_TS function.

I'm envisioning most users having long data in the general form of year, species, count and year, covariate_value (often with a site variable for both as well which in concept can be grouped by). To use with the current API this would require cross-tabbing the first table and if the sorts on the two tables aren't the same for some reason this will produce the wrong answer (hence my concern about this being a bit fragile).

My first thought was that the data should be long in both cases. I can see why this wasn't the initial implementation because it comes with its own set of issues, specifically that using long data would require passing the names of the "words" and "documents" columns so that the LDA step understands what it is supposed to work with. That said, I think this is more robust than assuming that the rows are the documents and the columns are the words since it's easy enough to mess up the cross-tabbing and get the components switched around. Given that this package is explicitly temporal, we could also use the opportunity to codify that in the API and outputs by passing the timename directly rather than via control = TS_controls_list(timename = "time_column").

I definitely don't fully understand the under-the-hood stuff that might make this change more complicated and think this is probably a pretty in-depth discussion, so I'd be happy to set up some time with whoever is interested to talk through the optimal API design (which could end up being what's already here).

ethanwhite commented 5 years ago

A change in the direction described above would also allow time-series only analyses without the need to create a covariates table. So, if the dataset is just an aggregation to time-series and counts (a common format), the current API requires splitting this into two tables and then cross-tabbing one of them. The alternative design would just consume the table as is and use the timename column (and potentially permutations of it) for modeling.

E.g.,

r_LDATS <- LDA_TS(data, time_col = newmoon, word_col = abundance,
                  topics = 2:5, nseeds = 2, changepoints = 1,
                  formulas = ~time + sin(time) + cos(time))
diazrenata commented 5 years ago

After chatting with @juniperlsimonis last week, I thought I'd add my thoughts re: the LDATS data format in the notes here. I don't have a well-informed idea of what's most robust or best-practice, but I've been interacting with LDATS as a user for a while now.

From my perspective, the main thing is that we've adopted the list(abundance, metadata, covariates) format for MATSS. If there is a major change to this format, we'll want to either re-organize MATSS to match it or keep LDATS support for the current format so that MATSS keeps working. So it would be nice to know if there are going to be changes sooner rather than later.

The other thing is that capacity for taking a table of multiple communities and subsetting them within LDATS might be very nice, but I'm not sure if it makes sense for that subsetting and re-bundling to happen within LDATS as opposed to on the user's end.

ethanwhite commented 5 years ago

I'd be happy to set up some time to chat through the API with anyone who's interested and see if we can come up with the best solution. Since neither MATSS or LDATS is at 1.0 or being used outside the group now is definitely the time to figure out what the core API should look like.

I'm fine with LDATS only handling one community at a time, but we want to make sure that it's easy to do that the most common data formats. That said, we may also want to include a convenience wrapper function at some point that handles the multi-community version to make it as easy as possible for end users.

juniperlsimonis commented 5 years ago

I'd definitely be game to chat through data options.

ethanwhite commented 5 years ago

If LDA_TS() consumed long data this would be an option to the multi-site situation.

ts_data = read.csv('ts_data.csv')
covar_data = read.csv('covar_data.csv')

ts_data %>%
filter(species != 'bob') %>%
group_by(site) %>%
LDA_TS(time_col = newmoon, word_col = abundance,
        covariates = covar_data)
ethanwhite commented 5 years ago

If we consume wide data this would be an option for the multi-site situation:

multi_ts_data = read.csv("multi_ts_data.csv")
covar_data = read.csv("covar_data.csv")

prep_long_ts_data(multi_ts_data, covar_data = NULL,
                 time_col = time, word_col = abundance,
                 group_col = site, term_col = species) %>%
purrr::pmap(LDA_TS)

And group_col could be set to NULL for single site data preparation from long data.

juniperlsimonis commented 5 years ago

currently, the API is

LDA_TS(doc_term_tab, doc_covar_tab, topics, nseeds, formulas, nchangepoints, weights, control)

inputs topics, nseeds, formulas, nchangepoints allow for expansion

juniperlsimonis commented 5 years ago

inputs in vegan tend to be like this

cca(dune ~ A1 + Management, data = dune.env)

where the two data sets are managed as [1] response variable in the formula for the term table and [2] data = for any covariates

ha0ye commented 5 years ago

Abundance input as long-form multi-site data, reshaped into wide-form for LDA_TS: Here, group_split creates a list of data.frames (1 for each value of site).

ts_data = read.csv('ts_data.csv')
covar_data = read.csv('covar_data.csv')

ts_data %>%
filter(species != 'bob') %>%
spread(species, abundance) %>%
group_split(site) %>%
purrr::map(LDA_TS, covariates = covar_data)
ha0ye commented 5 years ago

Abundance and covariates for a single-site, cross-tab format as a combined data structure in MATSS format:

ts_data <- read.csv("ts_data.csv")
covar_data <- read.csv("covar_data.csv")

dat <- list(abundance = ts_data, 
covariates = covar_data)

LDA_TS(dat) # or maybe LDA_TS(data = dat)
juniperlsimonis commented 5 years ago

take home from the conversation: the API for LDA_TS is likely going to stay the same (or very similar*), but we need to provide helper functions to facilitate translation from common data structures (esp long tables) to the format used here

*the two data structures might be combined into a single object/list

juniperlsimonis commented 5 years ago

The currently drafted v0.2.0 now has the API for LDA_TS() taking a single data argument that can presently be either just a document-term data table (matrix or data frame) or a list that includes at least a document-term data table and possibly also a covariate table (a la MATSS, although note that the selection of the term data is done using regexp and assumes the letters "term" are in the name of the element identifying the document term table). if no covariate table is included, the function makes the assumption that the data are equi-spersed in time. the way this is setup now will also allow us to expand the processing of the data argument in the future so that we can differentiate between different types of data tables. for now, I'm going to close this issue and then create a new one focused on creating the data helper functions, as I think that's what remains from here and I don't want it to get lost. let me know if this issue should stay open or if there are other things that need their own new issues.