How does the XCMS-only model compare to the additional metrics we've calculated?

wkumler commented 1 year ago

It would be really nice if we could get a good model from the XCMS parameters only. It's super fast to calculate those params because it's just summarizing across the dataset and there's no need to even access the raw data. If we get reasonably good performance with the XCMS params alone then that might be the only model we need.

wkumler commented 1 year ago

XCMS metrics are:

xcms_params <- c("mean_mz", "sd_ppm", "mean_rt", "sd_rt", "mean_pw", 
                 "sd_pw", "log_mean_height", "log_sd_height", "sn", "f", "scale", 
                 "lmin", "feat_npeaks", "n_found", "samps_found", "stans_found")

and using only those metrics to train a model on the Falkor dataset

falkor_xcms_model <- FT2040_features %>% 
  select(all_of(xcms_params), feat_class) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=feat_class=="Good") %>%
  glm(formula=feat_class~., family = binomial)
MS3000_preds <- MS3000_features %>%
  select(-feat_id) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(pred_prob=predict(object=falkor_full_model,newdata = ., type = "response")) %>%
  mutate(pred_class=pred_prob>0.5) %>%
  mutate(pred_class=ifelse(pred_class, "Good", "Bad"))
with(MS3000_preds, table(pred_class, feat_class))

	Bad	Good
pred_bad	1859	21
pred_good	85	228

compared to the "full" model of

	Bad	Good
pred_bad	1904	98
pred_good	40	151

wkumler commented 1 year ago

So with the XCMS-only model we get actually far fewer false positives, but we also get much fewer peaks overall. Our FDR with XCMS-only is about 20% while the full model is closer to 30%, but we only get about half as many "good" peaks in total.

wkumler commented 1 year ago

Slightly cleaner code than above that also compares the "minimal" model of just med_cor, med_SNR, and sd_rt:

library(tidyverse)
xcms_params <- c("mean_mz", "sd_ppm", "mean_rt", "sd_rt", "mean_pw", 
                 "sd_pw", "log_mean_height", "log_sd_height", "sn", "f", "scale", 
                 "lmin", "feat_npeaks", "n_found", "samps_found", "stans_found")
FT2040_features <- read_csv("made_data_FT2040/features_extracted.csv") %>%
  select(-blank_found, -feat_id) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=feat_class=="Good")
MS3000_features <- read_csv("made_data_MS3000/features_extracted.csv")

falkor_full_model <- FT2040_features %>%
  glm(formula=feat_class~., family = binomial)
falkor_xcms_model <- FT2040_features %>%
  select(all_of(xcms_params), feat_class) %>%
  glm(formula=feat_class~., family = binomial)
falkor_min_model <- FT2040_features %>%
  select(feat_class, med_cor, med_SNR, sd_rt) %>%
  glm(formula=feat_class~., family = binomial)

list(full=falkor_full_model, xcms=falkor_xcms_model, minimal=falkor_min_model) %>%
  sapply(function(mod){
    MS3000_features %>%
      mutate(pred_prob=predict(object=mod, newdata = ., type = "response")) %>%
      mutate(pred_class=pred_prob>0.5) %>%
      mutate(pred_class=ifelse(pred_class, "Good", "Bad")) %>%
      with(table(pred_class, feat_class)) %>%
      as.data.frame()
  }, simplify = FALSE) %>%
  bind_rows(.id="model") %>%
  pivot_wider(names_from = feat_class, values_from=Freq) %>%
  knitr::kable()

model	pred_class	Bad	Good	Meh	Stans only
full	Bad	1859	21	51	147
full	Good	85	228	71	72
xcms	Bad	1904	98	84	180
xcms	Good	40	151	38	39
minimal	Bad	1912	79	79	181
minimal	Good	32	170	43	38

wkumler / MS_metrics

How does the XCMS-only model compare to the additional metrics we've calculated? #13