Open ajstewartlang opened 5 years ago
Perhaps recoding as an ordered factor is better as it seems to capture the difference between the discrete scores better:
maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = recode_factor(anx_band_15, "0" = "0", "<0.1%" = "0.1", "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", ))
It looks like we need to do a fair bit of data tidying at the start. For example, for the variables comp_int_bed_16 and comp_noint_bed_16 the values in the dataset are "Yes" and "NA". Presumably the NA corresponds to "No" rather than reflecting real missing data:
> unique(maps_synthetic_data$comp_int_bed_16)
[1] NA "Yes"> unique(maps_synthetic_data$comp_noint_bed_16)
[1] NA "Yes"So I'm thinking we replace the NAs in these two variables with "No" and then turn them into 2-level factors:
maps_synthetic_data[is.na(maps_synthetic_data$comp_int_bed_16), ]$comp_int_bed_16 <- "No"
maps_synthetic_data[is.na(maps_synthetic_data$comp_noint_bed_16), ]$comp_noint_bed_16 <- "No"
maps_synthetic_data$comp_int_bed_16 <- factor(maps_synthetic_data$comp_int_bed_16)
maps_synthetic_data$comp_noint_bed_16 <- factor(maps_synthetic_data$comp_noint_bed_16)
For the for anxiety measure at age 15 variable, is looks like the NAs correspond to 0 rather than missing data:
> unique(maps_synthetic_data$anx_band_15)
[1] "~0.5%" NA "~3%" "~15%" "~50%" "<0.1%"So we might want to replace the NAs there with zeros.
maps_synthetic_data[is.na(maps_synthetic_data$anx_band_15), ]$anx_band_15 <- 0
But what about the other values? Should we make this an ordered factor or treat as numerical? If treating as numerical we could recode as:
maps_synthetic_data <- maps_synthetic_data %>% mutate(anx_band_15 = as.integer(recode(anx_band_15, "~0.5%" = ".5", "~3%" = "3", "~15%" = "15", "~50%" = "50", "<0.1%" = "0")))
Although that forces the <0.1% values to be 0. Would an ordered factor be better do you think? @wjchulme, @jspickering, @OliJimbo