wkumler / MS_metrics

5 stars 0 forks source link

How does the XCMS-only model compare to the additional metrics we've calculated? #13

Closed wkumler closed 1 year ago

wkumler commented 1 year ago

It would be really nice if we could get a good model from the XCMS parameters only. It's super fast to calculate those params because it's just summarizing across the dataset and there's no need to even access the raw data. If we get reasonably good performance with the XCMS params alone then that might be the only model we need.

wkumler commented 1 year ago

XCMS metrics are:

xcms_params <- c("mean_mz", "sd_ppm", "mean_rt", "sd_rt", "mean_pw", 
                 "sd_pw", "log_mean_height", "log_sd_height", "sn", "f", "scale", 
                 "lmin", "feat_npeaks", "n_found", "samps_found", "stans_found")

and using only those metrics to train a model on the Falkor dataset

falkor_xcms_model <- FT2040_features %>% 
  select(all_of(xcms_params), feat_class) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=feat_class=="Good") %>%
  glm(formula=feat_class~., family = binomial)
MS3000_preds <- MS3000_features %>%
  select(-feat_id) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(pred_prob=predict(object=falkor_full_model,newdata = ., type = "response")) %>%
  mutate(pred_class=pred_prob>0.5) %>%
  mutate(pred_class=ifelse(pred_class, "Good", "Bad"))
with(MS3000_preds, table(pred_class, feat_class))
Bad Good
pred_bad 1859 21
pred_good 85 228

compared to the "full" model of

Bad Good
pred_bad 1904 98
pred_good 40 151
wkumler commented 1 year ago

So with the XCMS-only model we get actually far fewer false positives, but we also get much fewer peaks overall. Our FDR with XCMS-only is about 20% while the full model is closer to 30%, but we only get about half as many "good" peaks in total.

wkumler commented 1 year ago

Slightly cleaner code than above that also compares the "minimal" model of just med_cor, med_SNR, and sd_rt:

library(tidyverse)
xcms_params <- c("mean_mz", "sd_ppm", "mean_rt", "sd_rt", "mean_pw", 
                 "sd_pw", "log_mean_height", "log_sd_height", "sn", "f", "scale", 
                 "lmin", "feat_npeaks", "n_found", "samps_found", "stans_found")
FT2040_features <- read_csv("made_data_FT2040/features_extracted.csv") %>%
  select(-blank_found, -feat_id) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=feat_class=="Good")
MS3000_features <- read_csv("made_data_MS3000/features_extracted.csv")

falkor_full_model <- FT2040_features %>%
  glm(formula=feat_class~., family = binomial)
falkor_xcms_model <- FT2040_features %>%
  select(all_of(xcms_params), feat_class) %>%
  glm(formula=feat_class~., family = binomial)
falkor_min_model <- FT2040_features %>%
  select(feat_class, med_cor, med_SNR, sd_rt) %>%
  glm(formula=feat_class~., family = binomial)

list(full=falkor_full_model, xcms=falkor_xcms_model, minimal=falkor_min_model) %>%
  sapply(function(mod){
    MS3000_features %>%
      mutate(pred_prob=predict(object=mod, newdata = ., type = "response")) %>%
      mutate(pred_class=pred_prob>0.5) %>%
      mutate(pred_class=ifelse(pred_class, "Good", "Bad")) %>%
      with(table(pred_class, feat_class)) %>%
      as.data.frame()
  }, simplify = FALSE) %>%
  bind_rows(.id="model") %>%
  pivot_wider(names_from = feat_class, values_from=Freq) %>%
  knitr::kable()
model pred_class Bad Good Meh Stans only
full Bad 1859 21 51 147
full Good 85 228 71 72
xcms Bad 1904 98 84 180
xcms Good 40 151 38 39
minimal Bad 1912 79 79 181
minimal Good 32 170 43 38