How does the regression model perform on other datasets?

wkumler commented 1 year ago

One big question is how well this model generalizes. The expected decrease in accuracy is probably as follows, from best-performing to worst-performing:

MESOSCOPE data (also from the Ingalls lab, run at almost the exact same time, same seawater location)
Ingalls culture data (also from the Ingalls lab, but run on cultures which tend to have a lot more biomass than open ocean, especially relative to the salt content)
Culture data from another lab (also run on HILIC/ESI/Orbitrap setup, but samples were a culture of P. tricornutum - see the MetabolomicsWorkbench link here)
Completely different metabolomic data from another lab - working with Edward Kolodziej?
Lipid data? QToF data? Proteins?

wkumler commented 1 year ago

The question, per always, is how we measure performance. See #8 kinda tangentially, but maybe we can get an easy answer to this and #9 by simply measuring the number of peaks above the 90% (or 99%) threshold?

wkumler commented 1 year ago

The full model performs incredibly poorly on the MESOSCOPE data. I haven't looked at any peaks yet, but we get a huge number of "good" peaks that simply isn't likely. They're also distributed across the entire dataset rather than being condensed into the areas of greatest likelihood:

model_2040 <- read_csv("made_data_FT2040/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
  select(-feat_id, -blank_found) %>%
  # select(feat_class, med_cor, med_SNR) %>%
  glm(formula=feat_class~., family = binomial)
feats_3000 <- read_csv("made_data_MS3000/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  select(-feat_id, -blank_found) %>%
  mutate(pred_prob=predict(object=model_2040, newdata = ., type = "response"))
feats_3000 %>%
  ggplot() +
  geom_histogram(aes(x=pred_prob))
feats_3000 %>%
  ggplot() +
  geom_point(aes(x=mean_rt, y=mean_mz, color=pred_prob)) +
  scale_color_viridis_c()

The only thing this model seems to be doing is identifying the very absolute worst peaks (very beginning and very end) and is wildly miscalibrated to an actual probability.

wkumler commented 1 year ago

A highly reduced model, however, consisting only of med_cor, med_SNR, and sd_rt (our top 3 metrics) appears to perform enormously better. We have a probability distribution much closer to that of the Falkor training set and I actually believe that these are in fact "good" peaks.

Of course, we won't know without actually looking at them whether or not it's wildly uncalibrated in the other direction (too conservative, with a likelihood of 0.9 actually corresponding to 99/100 good peaks. But this is maybe encouraging for the custom metrics?

This produces 151 features with likelihood 0.9 or higher which is totally checkable manually. If I look through them all, how many bad ones should I expect to find? This is something we can simulate and the code/histogram below indicates that if we get more than 15 it's highly suspicious, and values of 0 or 1 bad peaks are not impossible. Looks like a Poisson distribution?

pred_prob_vec <- read_csv("made_data_MS3000/features_extracted.csv") %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn)) %>%
  select(-feat_id, -blank_found) %>%
  predict(object=model_2040, newdata = ., type = "response")
hist(replicate(10000, {
  sum(runif(151) > pred_prob_vec[pred_prob_vec>0.9])
}), main = NULL, xlab = NULL)

wkumler commented 1 year ago

The XCMS parameters alone (mean_mz, sd_ppm, mean_rt, sd_rt, mean_pw, sd_pw, log_mean_height, log_sd_height, sn, f, scale, lmin, feat_npeaks, n_found, samps_found, stans_found, feat_class) don't seem to be as wildly off as the full model (or at least the model trained on those has fewer "excellent" peaks that are probably just trash):

wkumler commented 1 year ago

Update: the full model was completely borked because I had hard-coded the number of samples, standards, and total files. This is a problem because the model showed that the fewer files in which a sample is found, the more likely it is to be a peak. MESOSCOPE has more files, so we get weird negative values that make us incredibly confident that the peak is a "real peak" when in reality it's just super negative on a metric which it shouldn't be negative on at all.

When those three features are removed (n_feats, samps_found, stans_found) the model looks much much better - very close to the simplified/reduced model shown above with only med_cor, med_SNR, and sd_rt:

wkumler commented 1 year ago

Recalculating these metrics now - see fa2858d for the fix

wkumler commented 1 year ago

New updates from today - we ran into more issues with the t_pval feature. This metric measures the biological difference between treatments and uses a statistical test (i.e. a t-test for two groups, anova for 3+, etc.) to return a probability that the observed pattern is due just to chance. The problem here is that we can get wildly lower p-values if we use MESOSCOPE instead of Falkor because the MESOSCOPE samples are likely to have a larger biological difference (because they include samples from 175 meters) and a larger sample size (126 samples instead of 24). So we get a metric that's trained on the Falkor data set and returns (log) values in the -10 to 0 range that we're fitting the logistic regression to, then enormously lower values (-50ish) in the "new" set.

This means that when we apply a cut/threshold to the data based on this metric it'll likely be wildly miscalibrated because log p-values below -10 in Falkor all had likelihood ~1 of being a real peak, but that's no longer the case in MESOSCOPE. When the same (or a similar) cut is applied during the logistic regression, we get an enormously inflated false positive rate (~200 false positives).

I don't understand how there can be so many low p-values for noise peaks, but I suspect it's the same problem we had before - that the "tails" of a significantly different peak are getting integrated and keeping the biological difference without actually being a good peak. Here's what the most-different "bad" peak looks like:

which is, in fact, pretty different between the 175 meter samples (in purple) and the surface/DCM samples (in blue/green). When we zoom out we see that it's not even a tail - just a whole bunch of extra whatever this compound is showing up mostly in the 175 meter samples:

wkumler commented 1 year ago

One thing I attempted was a simple rescaling of the p-values by dividing by the largest (most negative) p-value so that I force them all into the range of 0-1. This seemed to work okay-ish - the curves look much more similar and we get a much better-looking confusion matrix:

The problem with this metric is that it's directly tied to the effect size of the biological/chemical signal, and doesn't handle generalization well. There are undoubtedly datasets out there with much larger biological effects and many that have no effect at all. In each case this metric cannot be properly calibrated because it will dramatically underestimate the quality of peaks in the first case and dramatically overestimate the quality of peaks in the latter. In general, this makes a lot of sense. Unlike most of the other metrics where the peak quality is independent of the actual experiment, this one requires a good amount of researcher/user knowledge and is more heavily based on metadata than the feature information itself.

Nonetheless, this could be a very powerful metric and it's certainly one that gives me an enormous amount of confidence in the thing being a real signal so I'm reluctant to drop it completely. A couple ideas come to mind. One is to somehow take the log of the already log-scaled values? This is a case where we do in fact have exponents that are orders of magnitude apart, scaling from 10^0 all the way down to 10^-50. Log-scaling the metric is how we've solved this problem before (too much spread / highly skewed data) but feels very cursed here. The other thought would be to use permutation testing instead, because permutation tests have a cap on their lowest possible p-value (set by the number of permutations). With 1,000 permutations, we end up with only a p-value of 10^-3 at most. This potentially handicaps the metric by losing information, but might be a good way to ensure the values don't fall wildly outside this range. Similarly, perhaps non-parametric statistics could be a similar "softer" version? The t-test/anova is so ridiculously powerful that it's actually causing us problems here.

N.B. I say it could be a very powerful metric, but in test/train sets even within Falkor/MESOSCOPE this metric inflated the false-positive AND false-negative rates. It's super unclear why this is happening but the model must be overfitting somehow using this metric:

dataset_version <- "FT2040"
output_folder <- paste0("made_data_", dataset_version, "/")
features_extracted <- read_csv(paste0(output_folder, "features_extracted.csv")) %>%
  mutate(sn=ifelse(is.infinite(sn), 0, sn))
set.seed(123)
traintestlist <- features_extracted %>%
  mutate(t_pval=t_pval/min(t_pval)) %>%
  filter(feat_class%in%c("Good", "Bad")) %>%
  mutate(feat_class=ifelse(feat_class=="Good", TRUE, FALSE)) %>%
  slice_sample(n = nrow(.)) %>%
  split(rbernoulli(nrow(.), 0.2)) %>%
  setNames(c("train", "test"))

full_model <- traintestlist$train %>% 
  select(-feat_id, -blank_found) %>%
  glm(formula=feat_class~., family = binomial)
traintestlist$test %>%
  cbind(pred_class=predict(full_model, ., type="response")>0.5) %>%
  with(table(feat_class, pred_class)) %>%
  print()

full_model <- traintestlist$train %>% 
  select(-feat_id, -blank_found, -t_pval) %>%
  glm(formula=feat_class~., family = binomial)
traintestlist$test %>%
  cbind(pred_class=predict(full_model, ., type="response")>0.5) %>%
  with(table(feat_class, pred_class)) %>%
  print()

Without t_pval:

	Bad	Good
Bad	317	1
Good	8	46

With scaled t_pval:

	Bad	Good
Bad	315	3
Good	8	46

wkumler commented 1 year ago

Code used to detect that t_pval was the one causing problems:

sapply(2:length(FT2040_features), function(i){
  mod_subset <- FT2040_features %>%
    filter(feat_class%in%c("Good", "Bad")) %>%
    mutate(feat_class=feat_class=="Good") %>%
    select(feat_class, any_of(names(FT2040_features)[2:i]))
  falkor_subset_model <- mod_subset %>%
    glm(formula=feat_class~., family = binomial)
  MS3000_preds <- MS3000_features %>%
    filter(feat_class%in%c("Good", "Bad")) %>%
    mutate(pred_class=predict(object=falkor_subset_model, newdata = ., type = "response")>0.5) %>%
    mutate(pred_class=ifelse(pred_class, "Good", "Bad"))
  feat_added <- names(FT2040_features)[i]
  n_false_pos <- table(c(MS3000_preds$pred_class, "Good", "Bad"), c(MS3000_preds$feat_class, "Bad", "Good"))[2,1]
  data.frame(feat_added, n_false_pos)
}) %>% t()

feat_added	n_false_pos
mean_mz	1
sd_ppm	1
mean_rt	1
sd_rt	68
mean_pw	54
sd_pw	56
log_mean_height	64
log_sd_height	62
sn	69
f	59
scale	23
lmin	25
feat_npeaks	30
n_found	29
samps_found	28
stans_found	37
med_SNR	37
med_cor	35
log_med_cor	28
med_missed_scans	29
t_pval	240
smp_to_blk	238
smp_to_std	248
shape_cor	248
area_cor	255
feat_class	255

wkumler / MS_metrics

How does the regression model perform on other datasets? #11