wjchulme / OSWGmcr-MAPS-collaboration

2 stars 2 forks source link

Statistical modelling approach #2

Open wjchulme opened 5 years ago

wjchulme commented 5 years ago

Firstly, Bayesian – yes or no? It might be a good learning opportunity for people with limited experience of Bayesian inference, including me. But I don’t have a strong feeling about this either way and happy to go with majority view.

As for the actual model, an odds ratio is required for the final output so the outcome variable is necessarily binary - depression at 18, yes/no. Logistic regression is the obvious candidate but it’s possible to recover an odds ratio from any model that can (be coerced to) provide the probability of the outcome with/without the exposure - including models with non-binary outcomes.

So it will depend on how depression is represented in the dataset which we won’t know until we get access. There’s a variable for a clinical diagnosis of depression at aged 18, has_dep_diag, but also some other variables relating to depressive symptoms at 18.

I can see where the path of least resistance will take us, but if anyone has a desire to do the non-obvious thing then let’s hear it! EIther way, it would be helpful to get a consensus on these issues as early as possible.

jspickering commented 5 years ago

I'm pretty keen on Bayesian. I've done it before but I'm no expert!

ajstewartlang commented 5 years ago

I've very limited experience with Bayesian approaches - when I have tried Bayesian models in the past (and really that's been limited to tinkering with the brms package in R) I've often worried about where I get my priors from.

jspickering commented 5 years ago

I may be wrong, but I think @OliJimbo is a pretty good Bayesian

wjchulme commented 5 years ago

Appropriate choice of priors is obviously the biggie for Bayes, but this shouldn't put us off - there are 13734 observations which is a fairly healthy sample size to ensure that even modest effect sizes will probably dominate an agreed set of sceptical priors.

What might put us off is that in the submission guidelines it says if you're using jags/bugs or stan to run a Bayesian analysis, please contact us. I presume this is because we need to submit 'analysis chunks' (discrete functions that load data, transform data, deal with missing values, etc) which might not be particularly compatible with a Stan workflow.

Can I suggest one or more of @ajstewartlang, @jspickering, and @OliJimbo (or anyone else familiar with brms/Stan) look into this and give some opinionated feedback on the pros/cons?

ajstewartlang commented 5 years ago

I'm just reading through the instructions now - shall we just use the brms packages for any Bayesian stuff we want to do? That seems like the most straightforward option as it will allow the Bayesian chunk to sit within our final R script.

ajstewartlang commented 5 years ago

It looks like our outcome measure is dep_score (Child's depression score on CIS-R) - which is an ordinal variable with possible values of NA, 0, 1, 2, 3, 4. I'm assuming here an NA means the child wasn't measured (so is actually a real NA missing data point) Would a cumulative link mixed model be a good starting point to explore how our predictors predict this dep_score outcome?

ajstewartlang commented 5 years ago

Or I guess a straight cumulative link model (i.e., no mixed part) would work too - @wjchulme ?

wjchulme commented 5 years ago

Thanks, Andrew.

I doubt we can gain much from a mixed model (with a random effect component) since from what I can tell there's no data on clusters/hierarchies to exploit.

I've just had a chance to look through the data this afternoon. As you say, the primary outcome variable is ordinal, so it makes sense to use ordinal models. In my experience, the choice of link function is both important (it materially affects the results) and arbitrary (you can rarely make a case for one function over another based on some underlying theory of the process at play). I'll leave it to the psychologists amongst us to decide if one makes more sense than another! Failing that, I'm sure a Normal distribution (probit link) is as good as any.

We've been asked to produce odd-ratios to describe the computeruse-depression relationship - as I've said above, these can be recovered from an ordinal model for a given yes/no definition of depression.

However, there's an interesting detail in the data dictionary for the depression variables, which says that "the categories range from <0.1% (0.1% children in this band have depression) to >70% (>70% of children in this band have depression)". This means we could marginalise the probability of depression according to these categories over the the probability of being in the category itself (as per an ordinal model). Then we don't have to define depression dichotomously, and instead we use the definition that matches the depression variables we've been given. EDIT: this applies only to the depband* variables, which isn't the outcome we're interested in, so this doesn't work!

In both approaches we develop an ordinal model - it's the definition of depression that differs, and with it the way to calculate our odds-ratio.

This second approach is probably most faithful to the information we have available. But let's start with the ordinal model and take it from there

ajstewartlang commented 5 years ago

Thanks Will - an ordinal model sounds good. How do we decide on what predictors/explanatory variables to add? I could see a case being made for quite a lot of those that are present in the dataset. Do we start with a model with just a few 'common sense' predictors?

wjchulme commented 5 years ago

Obviously there are a lot to chose from - we can restrict to variables that behave nicely (few missing values, sufficient variance, low colinearity, plausible) but the use of DAGs here may also help us decide based on some underlying mechanisms. But I'll leave this discussion for issue #4.

lanabojanic commented 5 years ago

Hello,

sorry for the late jump-in! I think we should do both frequentist & Bayesian- that being said I do have experience with Bayesian (my thesis) but unfortunately only in JASP. Agree with the use of the ordinal model as well. Might be a silly suggestion, but regarding predictors, it might be good to go with the most common ones, based on a lit review of the topic. I volunteer to do it, if we agree on it.

wjchulme commented 5 years ago

If you're able to put something together @lanabojanic that would be fantastic

wjchulme commented 5 years ago

I've been having a play around this afternoon. I've realised that the "interesting detail" (see my comment above) only applies to the depband* variables, which isn't the outcome variables we're interested in. So all that I said about marginalising over those given probabilities doesn't make any sense! So we can ignore that

OliJimbo commented 5 years ago

Agree with the use of Ordinal models, and this is easily justifiable with the recent tutorials on the subject. I am currently working on ordinal regression for my thesis using brms - it really hates the default priors which place too much probability on log(0) (i.e -inf) so even a general hypothesised direction would be fine (i.e. -ive or +ive). We could even do a sensitivity analysis to check.

I admit that I haven't had a look at the structure yet - but will take a look asap! I also concur with using both freq and bayes methods! The dataset is so large that the false positive rate might be inflated (if using significance testing in isolation!) so some sort of bayes factors, or model selection methods would be a good idea!

Oli