wkumler / MS_metrics

5 stars 0 forks source link

How does the regression model perform on other datasets? #11

Closed wkumler closed 1 year ago

wkumler commented 1 year ago

One big question is how well this model generalizes. The expected decrease in accuracy is probably as follows, from best-performing to worst-performing:

  1. MESOSCOPE data (also from the Ingalls lab, run at almost the exact same time, same seawater location)
  2. Ingalls culture data (also from the Ingalls lab, but run on cultures which tend to have a lot more biomass than open ocean, especially relative to the salt content)
  3. Culture data from another lab (also run on HILIC/ESI/Orbitrap setup, but samples were a culture of P. tricornutum - see the MetabolomicsWorkbench link here)
  4. Completely different metabolomic data from another lab - working with Edward Kolodziej?
  5. Lipid data? QToF data? Proteins?
wkumler commented 1 year ago

The question, per always, is how we measure performance. See #8 kinda tangentially, but maybe we can get an easy answer to this and #9 by simply measuring the number of peaks above the 90% (or 99%) threshold?

wkumler commented 1 year ago

The full model performs incredibly poorly on the MESOSCOPE data. I haven't looked at any peaks yet, but we get a huge number of "good" peaks that simply isn't likely. They're also distributed across the entire dataset rather than being condensed into the areas of greatest likelihood:

model_2040 <- read_csv("made_data_FT2040/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
  select(-feat_id, -blank_found) %>%
  # select(feat_class, med_cor, med_SNR) %>%
  glm(formula=feat_class~., family = binomial)
feats_3000 <- read_csv("made_data_MS3000/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  select(-feat_id, -blank_found) %>%
  mutate(pred_prob=predict(object=model_2040, newdata = ., type = "response"))
feats_3000 %>%
  ggplot() +
  geom_histogram(aes(x=pred_prob))
feats_3000 %>%
  ggplot() +
  geom_point(aes(x=mean_rt, y=mean_mz, color=pred_prob)) +
  scale_color_viridis_c()

image

image

The only thing this model seems to be doing is identifying the very absolute worst peaks (very beginning and very end) and is wildly miscalibrated to an actual probability.

wkumler commented 1 year ago

A highly reduced model, however, consisting only of med_cor, med_SNR, and sd_rt (our top 3 metrics) appears to perform enormously better. We have a probability distribution much closer to that of the Falkor training set and I actually believe that these are in fact "good" peaks.

image

image

Of course, we won't know without actually looking at them whether or not it's wildly uncalibrated in the other direction (too conservative, with a likelihood of 0.9 actually corresponding to 99/100 good peaks. But this is maybe encouraging for the custom metrics?

This produces 151 features with likelihood 0.9 or higher which is totally checkable manually. If I look through them all, how many bad ones should I expect to find? This is something we can simulate and the code/histogram below indicates that if we get more than 15 it's highly suspicious, and values of 0 or 1 bad peaks are not impossible. Looks like a Poisson distribution?

pred_prob_vec <- read_csv("made_data_MS3000/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  select(-feat_id, -blank_found) %>%
  predict(object=model_2040, newdata = ., type = "response")
hist(replicate(10000, {
  sum(runif(151) > pred_prob_vec[pred_prob_vec>0.9])
}), main = NULL, xlab = NULL)

image

wkumler commented 1 year ago

The XCMS parameters alone (mean_mz, sd_ppm, mean_rt, sd_rt, mean_pw, sd_pw, log_mean_height, log_sd_height, sn, f, scale, lmin, feat_npeaks, n_found, samps_found, stans_found, feat_class) don't seem to be as wildly off as the full model (or at least the model trained on those has fewer "excellent" peaks that are probably just trash):

image

image

wkumler commented 1 year ago

Update: the full model was completely borked because I had hard-coded the number of samples, standards, and total files. This is a problem because the model showed that the fewer files in which a sample is found, the more likely it is to be a peak. MESOSCOPE has more files, so we get weird negative values that make us incredibly confident that the peak is a "real peak" when in reality it's just super negative on a metric which it shouldn't be negative on at all.

When those three features are removed (n_feats, samps_found, stans_found) the model looks much much better - very close to the simplified/reduced model shown above with only med_cor, med_SNR, and sd_rt:

image

image

wkumler commented 1 year ago

Recalculating these metrics now - see fa2858d for the fix

wkumler commented 1 year ago

New updates from today - we ran into more issues with the t_pval feature. This metric measures the biological difference between treatments and uses a statistical test (i.e. a t-test for two groups, anova for 3+, etc.) to return a probability that the observed pattern is due just to chance. The problem here is that we can get wildly lower p-values if we use MESOSCOPE instead of Falkor because the MESOSCOPE samples are likely to have a larger biological difference (because they include samples from 175 meters) and a larger sample size (126 samples instead of 24). So we get a metric that's trained on the Falkor data set and returns (log) values in the -10 to 0 range that we're fitting the logistic regression to, then enormously lower values (-50ish) in the "new" set.

image

This means that when we apply a cut/threshold to the data based on this metric it'll likely be wildly miscalibrated because log p-values below -10 in Falkor all had likelihood ~1 of being a real peak, but that's no longer the case in MESOSCOPE. When the same (or a similar) cut is applied during the logistic regression, we get an enormously inflated false positive rate (~200 false positives).

I don't understand how there can be so many low p-values for noise peaks, but I suspect it's the same problem we had before - that the "tails" of a significantly different peak are getting integrated and keeping the biological difference without actually being a good peak. Here's what the most-different "bad" peak looks like:

image

which is, in fact, pretty different between the 175 meter samples (in purple) and the surface/DCM samples (in blue/green). When we zoom out we see that it's not even a tail - just a whole bunch of extra whatever this compound is showing up mostly in the 175 meter samples:

image

wkumler commented 1 year ago

One thing I attempted was a simple rescaling of the p-values by dividing by the largest (most negative) p-value so that I force them all into the range of 0-1. This seemed to work okay-ish - the curves look much more similar and we get a much better-looking confusion matrix:

image

The problem with this metric is that it's directly tied to the effect size of the biological/chemical signal, and doesn't handle generalization well. There are undoubtedly datasets out there with much larger biological effects and many that have no effect at all. In each case this metric cannot be properly calibrated because it will dramatically underestimate the quality of peaks in the first case and dramatically overestimate the quality of peaks in the latter. In general, this makes a lot of sense. Unlike most of the other metrics where the peak quality is independent of the actual experiment, this one requires a good amount of researcher/user knowledge and is more heavily based on metadata than the feature information itself.

Nonetheless, this could be a very powerful metric and it's certainly one that gives me an enormous amount of confidence in the thing being a real signal so I'm reluctant to drop it completely. A couple ideas come to mind. One is to somehow take the log of the already log-scaled values? This is a case where we do in fact have exponents that are orders of magnitude apart, scaling from 10^0 all the way down to 10^-50. Log-scaling the metric is how we've solved this problem before (too much spread / highly skewed data) but feels very cursed here. The other thought would be to use permutation testing instead, because permutation tests have a cap on their lowest possible p-value (set by the number of permutations). With 1,000 permutations, we end up with only a p-value of 10^-3 at most. This potentially handicaps the metric by losing information, but might be a good way to ensure the values don't fall wildly outside this range. Similarly, perhaps non-parametric statistics could be a similar "softer" version? The t-test/anova is so ridiculously powerful that it's actually causing us problems here.

N.B. I say it could be a very powerful metric, but in test/train sets even within Falkor/MESOSCOPE this metric inflated the false-positive AND false-negative rates. It's super unclear why this is happening but the model must be overfitting somehow using this metric:

dataset_version <- "FT2040"
output_folder <- paste0("made_data_", dataset_version, "/")
features_extracted <- read_csv(paste0(output_folder, "features_extracted.csv")) %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn))
set.seed(123)
traintestlist <- features_extracted %>%
  mutate(t_pval=t_pval/min(t_pval)) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
  slice_sample(n = nrow(.)) %>%
  split(rbernoulli(nrow(.), 0.2)) %>%
  setNames(c("train", "test"))

full_model <- traintestlist$train %>% 
  select(-feat_id, -blank_found) %>%
  glm(formula=feat_class~., family = binomial)
traintestlist$test %>%
  cbind(pred_class=predict(full_model, ., type="response")>0.5) %>%
  with(table(feat_class, pred_class)) %>%
  print()

full_model <- traintestlist$train %>% 
  select(-feat_id, -blank_found, -t_pval) %>%
  glm(formula=feat_class~., family = binomial)
traintestlist$test %>%
  cbind(pred_class=predict(full_model, ., type="response")>0.5) %>%
  with(table(feat_class, pred_class)) %>%
  print()

Without t_pval:

Bad Good
Bad 317 1
Good 8 46

With scaled t_pval:

Bad Good
Bad 315 3
Good 8 46
wkumler commented 1 year ago

Code used to detect that t_pval was the one causing problems:

sapply(2:length(FT2040_features), function(i){
  mod_subset <- FT2040_features %>%
    filter(feat_class%in%c("Good", "Bad")) %>%
    mutate(feat_class=feat_class=="Good") %>%
    select(feat_class, any_of(names(FT2040_features)[2:i]))
  falkor_subset_model <- mod_subset %>%
    glm(formula=feat_class~., family = binomial)
  MS3000_preds <- MS3000_features %>%
    filter(feat_class%in%c("Good", "Bad")) %>%
    mutate(pred_class=predict(object=falkor_subset_model, newdata = ., type = "response")>0.5) %>%
    mutate(pred_class=ifelse(pred_class, "Good", "Bad"))
  feat_added <- names(FT2040_features)[i]
  n_false_pos <- table(c(MS3000_preds$pred_class, "Good", "Bad"), c(MS3000_preds$feat_class, "Bad", "Good"))[2,1]
  data.frame(feat_added, n_false_pos)
}) %>% t()
feat_added n_false_pos
mean_mz 1
sd_ppm 1
mean_rt 1
sd_rt 68
mean_pw 54
sd_pw 56
log_mean_height 64
log_sd_height 62
sn 69
f 59
scale 23
lmin 25
feat_npeaks 30
n_found 29
samps_found 28
stans_found 37
med_SNR 37
med_cor 35
log_med_cor 28
med_missed_scans 29
t_pval 240
smp_to_blk 238
smp_to_std 248
shape_cor 248
area_cor 255
feat_class 255