wkumler / MS_metrics

5 stars 0 forks source link

How does the binomial regression perform on the "Meh" and "Stans only" classes? #10

Closed wkumler closed 1 year ago

wkumler commented 1 year ago
library(tidyverse)

# dataset_version <- "FT350"
dataset_version <- "FT2040"
output_folder <- paste0("made_data_", dataset_version, "/")

features_extracted <- read_csv(paste0(output_folder, "features_extracted.csv")) %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn))

set.seed(123)
traintestlist <- features_extracted %>%
  # select(-med_SNR, -med_cor, -med_missed_scans, -shape_cor, -area_cor) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
  slice_sample(n = nrow(.)) %>%
  split(rbernoulli(nrow(.), 0.2)) %>%
  setNames(c("train", "test"))

full_model <- traintestlist$train %>% 
  select(-feat_id, -blank_found) %>%
  glm(formula=feat_class~., family = binomial)
pred_prob_vec <- features_extracted %>%
  select(-blank_found, -feat_id) %>%
  predict(object=full_model, type = "response")

features_extracted %>%
  mutate(pred_prob=cut(pred_prob_vec, pretty(pred_prob_vec, n = 10))) %>%
  ggplot() +
  geom_bar(aes(x=pred_prob, fill=feat_class)) +
  facet_wrap(~feat_class, ncol = 1, scales = "free_y") +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5))

image

Unsurprisingly, we get a much "flatter" distribution among the meh and stans features. It is little surprising to see so much separation in the "stans only" but this also might just be because they're misclassified - sometimes a good peak hides underneath the super-tall standards.