wkumler / MS_metrics

5 stars 0 forks source link

Notes on logistic regressions #7

Closed wkumler closed 1 year ago

wkumler commented 1 year ago

Logistic regression seems to be doing an excellent job of separating good peaks from bad and I'm documenting here some of the code and outputs that defend this. The worries with logistic regression are 1) performs poorly with highly correlated variables, 2) can only handle binary output in its traditional form, and 3) can only handle linear predictors - no "sweet spot" for peakwidth or other metrics.

1) is less of a concern because we've already looked at the pairs plot of each set of variables and removed the highly correlated ones. Not sure how well this technique scales to additional variables that we add but seems to have worked pretty well for now.

2) is more or less fine because in the end we just want a final list of "use this" and "don't use this". Logistic regression will in theory produce a probability for each peak that can be used to control false positive/false negative rates that needs to be thresholded, ideally by setting an a-priori level of acceptable false discovery rate. This is also expandable to other classes (like standards only) with multinomial regression but that's far less intuitive to me.

3) is also less of a concern since we did some scaling to handle the highly skewed predictors and log-scaled things like peak height. It certainly won't get around the problem of a "sweet spot" but it doesn't seem like that's the case even for the expected predictors like peakwidth, so I'm avoiding it for now.

wkumler commented 1 year ago

When we run the full regression model, trained up on an 80:20 test set with 1,600 labeled training features, we get some excellent separation between "good" and "bad" peaks, good enough that we can choose almost any threshold between 0.1 and 0.9 and it'll do a good job of classifying the test set as good and bad.

dataset_version <- "FT2040"
output_folder <- paste0("made_data_", dataset_version, "/")

features_extracted <- read_csv(paste0(output_folder, "features_extracted.csv")) %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn))

traintestlist <- features_extracted %>%
  # select(-med_SNR, -med_cor, -med_missed_scans, -shape_cor, -area_cor) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
  slice_sample(n = nrow(.)) %>%
  split(rbernoulli(nrow(.), 0.2)) %>%
  setNames(c("train", "test"))

full_model <- traintestlist$train %>% 
  select(-feat_id, -blank_found) %>%
  glm(formula=feat_class~., family = binomial)
summary(full_model)

Model output:

Call:
glm(formula = feat_class ~ ., family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9180  -0.0897  -0.0345  -0.0108   3.5438  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -5.377e-01  3.806e+00  -0.141 0.887644    
mean_mz          -4.747e-03  2.729e-03  -1.740 0.081881 .  
sd_ppm            2.779e+06  4.298e+06   0.647 0.517937    
mean_rt           2.187e-03  1.250e-03   1.750 0.080182 .  
sd_rt            -3.090e-01  8.521e-02  -3.626 0.000288 ***
mean_pw           1.535e-02  3.194e-02   0.481 0.630869    
sd_pw            -7.733e-02  6.617e-02  -1.169 0.242560    
log_mean_height  -1.717e+00  1.339e+00  -1.283 0.199612    
log_sd_height     1.614e+00  1.044e+00   1.546 0.122119    
sn                3.917e-02  2.142e-01   0.183 0.854891    
f                -1.133e-02  6.147e-03  -1.843 0.065386 .  
scale            -2.379e-01  8.447e-02  -2.817 0.004851 ** 
lmin             -2.356e-03  1.399e-03  -1.684 0.092238 .  
feat_npeaks       1.265e-01  5.967e-02   2.120 0.033996 *  
n_found          -5.853e-01  2.388e-01  -2.451 0.014232 *  
samps_found       1.304e+01  6.658e+00   1.959 0.050145 .  
stans_found       4.785e+00  2.522e+00   1.897 0.057768 .  
med_SNR           5.273e-01  1.414e-01   3.728 0.000193 ***
med_cor           4.234e+00  9.709e-01   4.361  1.3e-05 ***
med_missed_scans  6.732e-02  2.553e-02   2.637 0.008375 ** 
t_pval           -1.407e+00  8.510e-01  -1.653 0.098231 .  
smp_to_blk        3.557e-01  2.700e-01   1.317 0.187681    
smp_to_std       -5.511e-01  3.760e-01  -1.466 0.142740    
shape_cor         1.359e+00  9.351e-01   1.453 0.146164    
area_cor          5.629e-01  6.921e-01   0.813 0.416015    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1113.66  on 1418  degrees of freedom
Residual deviance:  172.11  on 1394  degrees of freedom
AIC: 222.11

Number of Fisher Scoring iterations: 9

We get some useful results from this already - the fact that our custom metrics of med_cor (metric of good peak shape) and med_SNR (metric of peak noisiness) perform best is encouraging, and our third most important feature is a very sensible sd_rt (standard deviation of the peak retention time) is also validating. The mysterious "scale" continues to be surprisingly useful here, but the other good metrics are all pretty sensible.

Confusion matrix when model is applied to test set with threshold of 0.5:

Bad Good
Bad 304 2
Good 4 45

Confusion matrix when model is applied to test set with threshold of 0.1 (corresponding to a higher false positive rate because we're looking at anything the model considers more than 10% likely to be a real peak):

Bad Good
Bad 287 19
Good 2 47

Confusion matrix when model is applied to test set with threshold of 0.9 (corresponding to a higher false negative rate because we're looking only at things the model considers more than 90% likely to be a real peak):

Bad Good
Bad 306 0
Good 8 41

Visualization of feature distribution across likelihood:

image

Figure above from the code below:

init_gp <- traintestlist$test %>%
  cbind(pred_prob=predict(full_model, ., type="response")) %>%
  mutate(pred_prob=cut(pred_prob, pretty(pred_prob, n=15), include.lowest=TRUE)) %>%
  mutate(feat_class=factor(feat_class, labels = c("Bad", "Good"), levels=c(FALSE, TRUE))) %>%
  ggplot(aes(x=pred_prob, fill=feat_class)) +
  theme_bw() +
  theme(axis.title.x = element_blank()) +
  labs(fill="Classification") +
  scale_x_discrete(drop=FALSE)
bar_gp <- init_gp + 
  geom_bar() + 
  labs(y="Count") + 
  theme(axis.text.x = element_blank()) +
  ggtitle("Full multiple regression model predictions on test set")
filled_gp <- init_gp + 
  geom_bar(position = "fill") + 
  labs(y="Proportion") +
  theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5))
cowplot::plot_grid(bar_gp, filled_gp, ncol = 1)
wkumler commented 1 year ago

One of the other big question is feature selection - how many of these do we actually need in order to have the model perform as well as possible? We now know that a lot of these predictors are fairly useless according to their p-value so we should be able to remove them and improve our AIC score (lower is better). Our full model's AIC score is 231.7 and we're looking to improve on that.

If we start with a model that just includes our best three features, we get an AIC of 338.9:

traintestlist$train %>% 
  select(feat_class, med_cor, med_SNR, sd_rt) %>%
  glm(formula=feat_class~., family = binomial)

Residual Deviance: 330.9    AIC: 338.9

The xcms-only model has an even worse AIC, however:

traintestlist$train %>%
  select(mean_mz, sd_ppm, mean_rt, sd_rt, mean_pw, sd_pw, log_mean_height, log_sd_height, sn, f, scale, lmin, feat_class) %>%
  glm(formula=feat_class~., family = binomial)

Residual Deviance: 345.7    AIC: 371.7

Using the MASS::stepAIC function to perform stepwise feature addition/removal we end up with a "best" AIC of 220.73 which is barely below the full model, partially because it includes a whole bunch of features:

all_model_feats <- select(traintestlist$train, -feat_id, -blank_found)
full_model <- glm(formula=feat_class~., family = binomial, data = all_model_feats)
step_model <- MASS::stepAIC(full_model, direction = "both")

dropping eight of them (feat_class doesn't count):

> setdiff(names(all_model_feats), broom::tidy(step_model)$term)
[1] "sd_ppm"        "mean_rt"       "mean_pw"       "sd_pw"         "log_sd_height"
[6] "sn"            "f"             "area_cor"      "feat_class"  
wkumler commented 1 year ago

Whoops I forgot to set.seed for the above analyses (randomized what was in the test and train set). Things are basically the same but will be more consistent moving forward.