Closed wkumler closed 1 year ago
When we run the full regression model, trained up on an 80:20 test set with 1,600 labeled training features, we get some excellent separation between "good" and "bad" peaks, good enough that we can choose almost any threshold between 0.1 and 0.9 and it'll do a good job of classifying the test set as good and bad.
dataset_version <- "FT2040"
output_folder <- paste0("made_data_", dataset_version, "/")
features_extracted <- read_csv(paste0(output_folder, "features_extracted.csv")) %>%
mutate(sn=ifelse(is.infinite(sn), 0, sn))
traintestlist <- features_extracted %>%
# select(-med_SNR, -med_cor, -med_missed_scans, -shape_cor, -area_cor) %>%
filter(feat_class%in%c("Good", "Bad")) %>%
mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
slice_sample(n = nrow(.)) %>%
split(rbernoulli(nrow(.), 0.2)) %>%
setNames(c("train", "test"))
full_model <- traintestlist$train %>%
select(-feat_id, -blank_found) %>%
glm(formula=feat_class~., family = binomial)
summary(full_model)
Model output:
Call:
glm(formula = feat_class ~ ., family = binomial, data = .)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9180 -0.0897 -0.0345 -0.0108 3.5438
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.377e-01 3.806e+00 -0.141 0.887644
mean_mz -4.747e-03 2.729e-03 -1.740 0.081881 .
sd_ppm 2.779e+06 4.298e+06 0.647 0.517937
mean_rt 2.187e-03 1.250e-03 1.750 0.080182 .
sd_rt -3.090e-01 8.521e-02 -3.626 0.000288 ***
mean_pw 1.535e-02 3.194e-02 0.481 0.630869
sd_pw -7.733e-02 6.617e-02 -1.169 0.242560
log_mean_height -1.717e+00 1.339e+00 -1.283 0.199612
log_sd_height 1.614e+00 1.044e+00 1.546 0.122119
sn 3.917e-02 2.142e-01 0.183 0.854891
f -1.133e-02 6.147e-03 -1.843 0.065386 .
scale -2.379e-01 8.447e-02 -2.817 0.004851 **
lmin -2.356e-03 1.399e-03 -1.684 0.092238 .
feat_npeaks 1.265e-01 5.967e-02 2.120 0.033996 *
n_found -5.853e-01 2.388e-01 -2.451 0.014232 *
samps_found 1.304e+01 6.658e+00 1.959 0.050145 .
stans_found 4.785e+00 2.522e+00 1.897 0.057768 .
med_SNR 5.273e-01 1.414e-01 3.728 0.000193 ***
med_cor 4.234e+00 9.709e-01 4.361 1.3e-05 ***
med_missed_scans 6.732e-02 2.553e-02 2.637 0.008375 **
t_pval -1.407e+00 8.510e-01 -1.653 0.098231 .
smp_to_blk 3.557e-01 2.700e-01 1.317 0.187681
smp_to_std -5.511e-01 3.760e-01 -1.466 0.142740
shape_cor 1.359e+00 9.351e-01 1.453 0.146164
area_cor 5.629e-01 6.921e-01 0.813 0.416015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1113.66 on 1418 degrees of freedom
Residual deviance: 172.11 on 1394 degrees of freedom
AIC: 222.11
Number of Fisher Scoring iterations: 9
We get some useful results from this already - the fact that our custom metrics of med_cor (metric of good peak shape) and med_SNR (metric of peak noisiness) perform best is encouraging, and our third most important feature is a very sensible sd_rt (standard deviation of the peak retention time) is also validating. The mysterious "scale" continues to be surprisingly useful here, but the other good metrics are all pretty sensible.
Confusion matrix when model is applied to test set with threshold of 0.5:
Bad | Good | |
---|---|---|
Bad | 304 | 2 |
Good | 4 | 45 |
Confusion matrix when model is applied to test set with threshold of 0.1 (corresponding to a higher false positive rate because we're looking at anything the model considers more than 10% likely to be a real peak):
Bad | Good | |
---|---|---|
Bad | 287 | 19 |
Good | 2 | 47 |
Confusion matrix when model is applied to test set with threshold of 0.9 (corresponding to a higher false negative rate because we're looking only at things the model considers more than 90% likely to be a real peak):
Bad | Good | |
---|---|---|
Bad | 306 | 0 |
Good | 8 | 41 |
Visualization of feature distribution across likelihood:
Figure above from the code below:
init_gp <- traintestlist$test %>%
cbind(pred_prob=predict(full_model, ., type="response")) %>%
mutate(pred_prob=cut(pred_prob, pretty(pred_prob, n=15), include.lowest=TRUE)) %>%
mutate(feat_class=factor(feat_class, labels = c("Bad", "Good"), levels=c(FALSE, TRUE))) %>%
ggplot(aes(x=pred_prob, fill=feat_class)) +
theme_bw() +
theme(axis.title.x = element_blank()) +
labs(fill="Classification") +
scale_x_discrete(drop=FALSE)
bar_gp <- init_gp +
geom_bar() +
labs(y="Count") +
theme(axis.text.x = element_blank()) +
ggtitle("Full multiple regression model predictions on test set")
filled_gp <- init_gp +
geom_bar(position = "fill") +
labs(y="Proportion") +
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5))
cowplot::plot_grid(bar_gp, filled_gp, ncol = 1)
One of the other big question is feature selection - how many of these do we actually need in order to have the model perform as well as possible? We now know that a lot of these predictors are fairly useless according to their p-value so we should be able to remove them and improve our AIC score (lower is better). Our full model's AIC score is 231.7 and we're looking to improve on that.
If we start with a model that just includes our best three features, we get an AIC of 338.9:
traintestlist$train %>%
select(feat_class, med_cor, med_SNR, sd_rt) %>%
glm(formula=feat_class~., family = binomial)
Residual Deviance: 330.9 AIC: 338.9
The xcms-only model has an even worse AIC, however:
traintestlist$train %>%
select(mean_mz, sd_ppm, mean_rt, sd_rt, mean_pw, sd_pw, log_mean_height, log_sd_height, sn, f, scale, lmin, feat_class) %>%
glm(formula=feat_class~., family = binomial)
Residual Deviance: 345.7 AIC: 371.7
Using the MASS::stepAIC
function to perform stepwise feature addition/removal we end up with a "best" AIC of 220.73 which is barely below the full model, partially because it includes a whole bunch of features:
all_model_feats <- select(traintestlist$train, -feat_id, -blank_found)
full_model <- glm(formula=feat_class~., family = binomial, data = all_model_feats)
step_model <- MASS::stepAIC(full_model, direction = "both")
dropping eight of them (feat_class doesn't count):
> setdiff(names(all_model_feats), broom::tidy(step_model)$term)
[1] "sd_ppm" "mean_rt" "mean_pw" "sd_pw" "log_sd_height"
[6] "sn" "f" "area_cor" "feat_class"
Whoops I forgot to set.seed for the above analyses (randomized what was in the test and train set). Things are basically the same but will be more consistent moving forward.
Logistic regression seems to be doing an excellent job of separating good peaks from bad and I'm documenting here some of the code and outputs that defend this. The worries with logistic regression are 1) performs poorly with highly correlated variables, 2) can only handle binary output in its traditional form, and 3) can only handle linear predictors - no "sweet spot" for peakwidth or other metrics.
1) is less of a concern because we've already looked at the pairs plot of each set of variables and removed the highly correlated ones. Not sure how well this technique scales to additional variables that we add but seems to have worked pretty well for now.
2) is more or less fine because in the end we just want a final list of "use this" and "don't use this". Logistic regression will in theory produce a probability for each peak that can be used to control false positive/false negative rates that needs to be thresholded, ideally by setting an a-priori level of acceptable false discovery rate. This is also expandable to other classes (like standards only) with multinomial regression but that's far less intuitive to me.
3) is also less of a concern since we did some scaling to handle the highly skewed predictors and log-scaled things like peak height. It certainly won't get around the problem of a "sweet spot" but it doesn't seem like that's the case even for the expected predictors like peakwidth, so I'm avoiding it for now.