Data treatment/tidying - Githubissues

It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:

> unique(maps_synthetic_data$comp_int_bed_16) [1] NA "Yes"

> unique(maps_synthetic_data$comp_noint_bed_16) [1] NA "Yes"

So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:

maps_synthetic_data[is.na(maps_synthetic_data$comp_int_bed_16), ]$comp_int_bed_16 <- "No" maps_synthetic_data[is.na(maps_synthetic_data$comp_noint_bed_16), ]$comp_noint_bed_16 <- "No" maps_synthetic_data$comp_int_bed_16 <- factor(maps_synthetic_data$comp_int_bed_16) maps_synthetic_data$comp_noint_bed_16 <- factor(maps_synthetic_data$comp_noint_bed_16)

For the for anxiety measure at age 15 variable, is looks like the NAs correspond to 0 rather than missing data:

> unique(maps_synthetic_data$anx_band_15) [1] "~0.5%" NA "~3%" "~15%" "~50%" "<0.1%"

So we might want to replace the NAs there with zeros.

maps_synthetic_data[is.na(maps_synthetic_data$anx_band_15), ]$anx_band_15 <- 0

But what about the other values? Should we make this an ordered factor or treat as numerical? If treating as numerical we could recode as:

maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = as.integer(recode(anx_band_15, "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", "<0.1%" = "0")))

Although that forces the <0.1% values to be 0. Would an ordered factor be better do you think? @wjchulme, @jspickering, @OliJimbo

wjchulme / OSWGmcr-MAPS-collaboration

Data treatment/tidying #7