wjchulme / OSWGmcr-MAPS-collaboration

2 stars 2 forks source link

Confounders #4

Open wjchulme opened 5 years ago

wjchulme commented 5 years ago

This is where it gets spicy. The wonderful thing about synthetic data is that you can’t p-hack - you have to specify your study design based on the data schema, and not the data values, since you can’t rely on the results.

An obvious important confounder is the pre-existence of depression at or before aged 16. We have variables dep_band_10, dep_band_13, and dep_band_15, which are caregiver-reported (eg by mum or dad) depression of the child aged 10, 13, and 15. We should consider at least dep_band_15.

After that, it’s a free-for-all. We can’t use everything, but we want to be sure we’re adjusting for relevant confounders. We can’t (and shouldn’t) use univariate associations or step-wise variable selection, etc. Causal inference would be helpful here – see https://doi.org/10.1007/s10654-019-00494-6

A first step might be for somebody to draw a DAG representing beliefs about causal relationships. We can reason about confounders from that starting point.

jspickering commented 5 years ago

What does the acronym "DAG" stand for? I'm sure it'll be obvious when you say it.

Thanks for the link to the paper btw, very useful.

wjchulme commented 5 years ago

I should've been explicit - a DAG is a Directed Acyclic Graph

They are widely used in the causal inference literature to describe causal relationships between different variables.

It's been argued (somewhere!) that DAGs should be used more often in observational studies (such as MAPS) where we want to control for confounders of the y~x relationship in a more principled way. "more principled" = not throwing in any potential confouder z such that y~z and x~z and not using stepwise variable selection.

Here's a nice overview: https://doi.org/10.1093/ije/dyw341

jspickering commented 5 years ago

Sorry - I thought I'd responded to this! I'll look into that a bit more.

I think year of birth might also be an important confounder. Children of the 90s presumably covers the whole span of 1990-1999, but we might have to check. I was born in 1991, and my computer usage drastically changed with the times as things like Neopets, MySpace, Bebo, Facebook, Twitter cropped up. I didn't have a smartphone at 16, but probably a lot of kids born in 1999 did. Do we know if "computer" usage covers laptops/PCs only?

I'm just thinking out loud here really! See comment below, born between 1990-1992 only.

wjchulme commented 5 years ago

This is really interesting point. Year of birth as a proxy for accessibility/popularity of computers. I don't think this is available in the dataset though?

jspickering commented 5 years ago

I have no idea because the public data dictionary has been deleted and I've not yet been approved for the collaborator sections of the OSF page! Is there any other variable that can act as a proxy?

jspickering commented 5 years ago

Nevermind, it looks as though they only recruited children born between 1990-1992 only so birth year is unlikely to be an issue http://www.bristol.ac.uk/alspac/researchers/cohort-profile/

ajstewartlang commented 5 years ago

Given the large number of variables (84) in the dataset should we consider something like specification curve analysis? I came across it recently in Amy Orban's paper:

https://www.nature.com/articles/s41562-018-0506-1

It looks like it can put the magnitude of effects in context and seems quite appropriate for such a large dataset.

wjchulme commented 5 years ago

Hi Andrew. I think the SCA approach is pretty much what the MAPS project is trying to do as a whole. And anyway we don't really have the time to design and run multiple analyses since we've only a few weeks left! If we just iterated through all confounder combinations it would take forever, so we'd have to be selective, and that process itself takes time. Even if we did do it, we'd only be reducing it all down to a single odds-ratio to pass on to MAPS anyway, and we wouldn't really have the opportunity to explore the variation in results, a key purpose of the SCA approach.

I hope you'll agree it's best to agree on a single analysis plan and run with that.

ajstewartlang commented 5 years ago

Yup, this sounds approach sensible to me.