Missing values - Githubissues

wjchulme commented 5 years ago

How we choose to deal with missing values depends on how their distribution throughout the dataset. Complete-case (removing all observations where some information is missing) is probably justified if you’re removing no more than about 10%. Otherwise, Multiple imputation is usually an approximately valid approach to take, depending on what we’re willing to assume about why the data might be missing.

The use of synthetic data forces us to consider this up-front. First thing to investigate is missing value patterns once we get the dataset. Who wants to volunteer? Some inspiration.

jspickering commented 5 years ago

I'll leave this to someone more experienced, but presumably it's sensible to assume that missing data is unlikely to be random for depression vs computer use data, and aged 16 vs 18 data?

lanabojanic commented 5 years ago

not too experienced either, but I volunteer!

wjchulme commented 5 years ago

Excellent, thanks @lanabojanic. Be aware that, as @ajstewartlang has said, many NA values are actually not missing but encode a specific value, like no. SO these will need to be recoded first.

wjchulme / OSWGmcr-MAPS-collaboration

Missing values #6