wkumler / MS_metrics

6 stars 0 forks source link

Principal component regression to resolve correlated variables problem? #15

Closed wkumler closed 1 year ago

wkumler commented 1 year ago

In theory the "correct" thing to do with correlated variables in a linear regression is to perform a PCA and regress on the principal components. It kinda sucks because then your predictor variables are synthetic combinations of the original variables, but it'll probably come up as a way to solve/help with the correlated predictors problems so I'd at least like to know whether or not it improves things enormously (i.e. worth the interpretation difficulty).

wkumler commented 1 year ago

Initial fitting of the PCA to this data was a little confusing already - I'd expected the largest variance in the dataset to be between "good" and "bad" peaks but instead it seemed to be largest between features for which a lot of peaks were found and those for which the fewest peaks were found:

library(tidyverse)
FT2040_features <- read_csv("made_data_FT2040/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  select(-blank_found)
feat_mat <- FT2040_features %>%
  select(-feat_id, -med_missed_scans, -shape_cor, -area_cor, -feat_class, -n_found) %>%
  data.matrix()
pca_model <- prcomp(feat_mat, scale. = TRUE, center = TRUE)
biplot(pca_model)
pca_model$x %>%
  as.data.frame() %>%
  select(PC1, PC2) %>%
  bind_cols(FT2040_features %>% select(feat_class)) %>%
  ggplot() +
  geom_point(aes(x=PC1, y=PC2, color=feat_class))

image

The first two components (and others) seem to be a super complex mix of actual variables without any clear dominance of one or the other.

pca_model$rotation %>%
  as.data.frame() %>%
  rownames_to_column("param") %>%
  pivot_longer(cols=-param) %>%
  mutate(name=factor(name, levels=paste0("PC", 1:23))) %>%
  ggplot() +
  geom_tile(aes(x=param, y=name, fill=value)) +
  scale_fill_viridis_c() +
  theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

image

wkumler commented 1 year ago

Performance on the test train set WITHIN Falkor was very promising:

set.seed(123)
traintestlist <- FT2040_features %>%
  bind_cols(pca_model$x[,1:3]) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=feat_class=="Good") %>%
  select(feat_class, PC1, PC2, PC3) %>%
  split(runif(nrow(.))<0.2) %>%
  setNames(c("train", "test"))
pcr_model <- traintestlist$train %>%
  glm(formula=feat_class~., family = binomial)
traintestlist$test %>%
  mutate(pred_prob=predict(pcr_model, newdata=., type="response")) %>%
  mutate(pred_class=ifelse(pred_prob>0.5, "Good", "Bad")) %>%
  with(table(pred_class, feat_class)) %>%
  knitr::kable()
Bad Good
pred_bad 302 5
pred_good 5 46
wkumler commented 1 year ago

But performance completely fell apart when the PCR model was applied to the MESOSCOPE data. I had some difficulty with this because I had to scale the MESOSCOPE features in the same way as the Falkor features (the sweeping below) and had to have Bryna help me do the matrix multiplication but I think it's all functional now.

MS3000_features <- read_csv("made_data_MS3000/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  select(-blank_found)
MS3000_mat <- MS3000_features %>%
  select(all_of(rownames(pca_model$rotation))) %>%
  data.matrix() %>%
  sweep(2, STATS = pca_model$center, FUN = `-`) %>%
  sweep(2, STATS = pca_model$sdev, FUN = `/`)
MS3000_preds <- cbind(MS3000_features, MS3000_mat%*%pca_model$rotation[,1:3])
MS3000_preds %>%
  mutate(pred_prob=predict(pcr_model, newdata=., type="response")) %>%
  mutate(pred_class=ifelse(pred_prob>0.5, "Good", "Bad")) %>%
  with(table(pred_class, feat_class))
Bad Good Meh Stans only
pred_bad 798 12 22 16
pred_good 1146 237 100 203
wkumler commented 1 year ago

The reason for this does, in fact, appear to be the scaling problem. After scaling the mean should be zero and the variance about 1, but we're nowhere close to this if we apply the Falkor scaling to the MESOSCOPE dataset. The average m/z value for MESOSCOPE is about 50 Daltons off from Falkor and the average retention time is nearly 100 seconds different, so the estimates from the predictor are just completely wrong.