Closed wkumler closed 1 year ago
XCMS metrics are:
xcms_params <- c("mean_mz", "sd_ppm", "mean_rt", "sd_rt", "mean_pw",
"sd_pw", "log_mean_height", "log_sd_height", "sn", "f", "scale",
"lmin", "feat_npeaks", "n_found", "samps_found", "stans_found")
and using only those metrics to train a model on the Falkor dataset
falkor_xcms_model <- FT2040_features %>%
select(all_of(xcms_params), feat_class) %>%
filter(feat_class%in%c("Good", "Bad")) %>%
mutate(feat_class=feat_class=="Good") %>%
glm(formula=feat_class~., family = binomial)
MS3000_preds <- MS3000_features %>%
select(-feat_id) %>%
filter(feat_class%in%c("Good", "Bad")) %>%
mutate(pred_prob=predict(object=falkor_full_model,newdata = ., type = "response")) %>%
mutate(pred_class=pred_prob>0.5) %>%
mutate(pred_class=ifelse(pred_class, "Good", "Bad"))
with(MS3000_preds, table(pred_class, feat_class))
Bad | Good | |
---|---|---|
pred_bad | 1859 | 21 |
pred_good | 85 | 228 |
compared to the "full" model of
Bad | Good | |
---|---|---|
pred_bad | 1904 | 98 |
pred_good | 40 | 151 |
So with the XCMS-only model we get actually far fewer false positives, but we also get much fewer peaks overall. Our FDR with XCMS-only is about 20% while the full model is closer to 30%, but we only get about half as many "good" peaks in total.
Slightly cleaner code than above that also compares the "minimal" model of just med_cor
, med_SNR
, and sd_rt
:
library(tidyverse)
xcms_params <- c("mean_mz", "sd_ppm", "mean_rt", "sd_rt", "mean_pw",
"sd_pw", "log_mean_height", "log_sd_height", "sn", "f", "scale",
"lmin", "feat_npeaks", "n_found", "samps_found", "stans_found")
FT2040_features <- read_csv("made_data_FT2040/features_extracted.csv") %>%
select(-blank_found, -feat_id) %>%
filter(feat_class%in%c("Good", "Bad")) %>%
mutate(feat_class=feat_class=="Good")
MS3000_features <- read_csv("made_data_MS3000/features_extracted.csv")
falkor_full_model <- FT2040_features %>%
glm(formula=feat_class~., family = binomial)
falkor_xcms_model <- FT2040_features %>%
select(all_of(xcms_params), feat_class) %>%
glm(formula=feat_class~., family = binomial)
falkor_min_model <- FT2040_features %>%
select(feat_class, med_cor, med_SNR, sd_rt) %>%
glm(formula=feat_class~., family = binomial)
list(full=falkor_full_model, xcms=falkor_xcms_model, minimal=falkor_min_model) %>%
sapply(function(mod){
MS3000_features %>%
mutate(pred_prob=predict(object=mod, newdata = ., type = "response")) %>%
mutate(pred_class=pred_prob>0.5) %>%
mutate(pred_class=ifelse(pred_class, "Good", "Bad")) %>%
with(table(pred_class, feat_class)) %>%
as.data.frame()
}, simplify = FALSE) %>%
bind_rows(.id="model") %>%
pivot_wider(names_from = feat_class, values_from=Freq) %>%
knitr::kable()
model | pred_class | Bad | Good | Meh | Stans only |
---|---|---|---|---|---|
full | Bad | 1859 | 21 | 51 | 147 |
full | Good | 85 | 228 | 71 | 72 |
xcms | Bad | 1904 | 98 | 84 | 180 |
xcms | Good | 40 | 151 | 38 | 39 |
minimal | Bad | 1912 | 79 | 79 | 181 |
minimal | Good | 32 | 170 | 43 | 38 |
It would be really nice if we could get a good model from the XCMS parameters only. It's super fast to calculate those params because it's just summarizing across the dataset and there's no need to even access the raw data. If we get reasonably good performance with the XCMS params alone then that might be the only model we need.