Nested Cross Validation

Hi @rbalshaw @farnushfarhadi

Thanks for the discussion today on cross validation. I'm finding the concepts in machine learning difficult to understand and the sheer depth that you guys go into can sometimes be overwhelming! However, your comments are helpful and always appreciated. I think we're learning things at a steady pace and even though the the material is challenging, I'm very interested in learning more.

I wanted to summarize your comments and ask a question about choosing the number of folds when your dataset is small (like ours).

We want to build our model (choosing the standard model parameters) using our entire training dataset because this will give the most generalizable model. The standard model parameters are determined (learned) from the data (these are weighted cpg sites, where coefficients = weigts).
Hyperparameters are parameters that cannot be directly learned from the data. These are things that need to be varied and tested in order to determine the optimal values. This is the 'alpha' and 'lambda' in the glmnet model, which determine how much of L1 norm : L2 norm to set in the model
These penalization terms are used to reduce complexity, in order to avoid overfitting (regularization)
So 'tuning the parameters' of our model specifically means determining the values of the hyperparameters to set in our model, and we do this by choosing the values that result in the highest performance. This process is done in the bit of code that I showed you today (not nested) -> basically the first level of cross validation.

If we stop here then what we're left with is a model trained on the entire data. This doesn't give us an idea of how it will generalize (even though we accounted for overfitting with our elastic net penalizations?). So even though we did cross validation already, that was only to optimize lambda and alpha. To get an idea of it's generalizability, we need to cross validate the entire model building process (folds within folds). So this is what you were saying @rbalshaw about writing a 'for' loop on that whole bit of code.

I'm wondering now what an appropriate number of folds is for our dataset.

Since we have 45 samples (with 33 Caucasians, 11 Asians), if we took k = 5 for the first cross validation sampling (for measuring test error), then we get folds of n = 9. Within these folds, if we create more folds k' = 5, then we would get a sample size of n' = ~2 for each fold. This seems like it would each time a model's accuracy is calculated, it wouldn't be very reliability, because the error is only measured on two predictions. Would a k = 3, and k' = 3 be more appropriate for our dataset?

Thanks, Victor

Hi Victor,

... and the sheer depth that you guys go into can sometimes be overwhelming! However, your comments are helpful and always appreciated…

Sorry if it gets overwhelming. There are so many moving parts to the terrific projects you guys tackle in this class, it’s difficult to know which piece each team needs help with.

I wanted to summarize your comments and ask a question about choosing the number of folds when your dataset is small (like ours).

We want to build our model (choosing the standard model parameters) using our entire training dataset because this will give the most generalizable model. The standard model parameters are determined (learned) from the data (these are weighted cpg sites, where coefficients = weigts).

Sounds good.

Hyperparameters are parameters that cannot be directly learned from the data. These are things that need to be varied and tested in order to determine the optimal values. This is the 'alpha' and 'lambda' in the glmnet model, which determine how much of L1 norm : L2 norm to set in the model

Two things here:

First, I would have called these “tuning parameters”. Hyperparameters often mean something else in the stats literature.

Second, these tuning parameters, alpha and lambda in this case, can very easily be “learned” from the data. This is very often what is done. We fit the models with a range of alpha and lambda values and choose one that optimizes some performance criteria (accuracy, model fit, etc.). The danger is that this induces overfitting.

That’s why the usual recipe calls for some sort of cross-validation of this process. Rather than optimizing the in-sample performance criteria, we choose tuning parameters that optimize the out-of-sample performance criteria.

E.g., assume 10-fold cross-validation in a lasso model.

We create a partitioning of the full data, each “fold” comprising 10% of the data. In the first CV run, we use 9 of the folds (90% of the data) to fit the model for a range of tuning parameter values.

We choose the values of the tuning parameter that make the model fit best when it’s tested against the held-back fold (the other 10%). That gives us one “unbiased” estimate of the “best” value for the tuning parameter.

Then we repeat this 10 times, leaving out a different fold each time.

One common strategy is then to average the 10 estimates of the tuning parameter, and then fit our “final model” using the “full data” with the tuning parameter set equal to that average value.

In this way, we’ve only considered values of the tuning parameter that appear to make the model fit well when it is tested against “fresh” data. This is a pretty reliable way to reduce the overfitting that we’d see if we just picked the tuning parameter value that fit our full data the best.

These penalization terms are used to reduce complexity, in order to avoid overfitting (regularization)

The lambda term is often thought of in exactly this way. They can have other effects that are focus on improved model peformance, but you’ve got the right idea - and certainly the one we were focussed when we talked today.

So 'tuning the parameters' of our model specifically means determining the values of the hyperparameters to set in our model, and we do this by choosing the values that result in the highest performance.

Careful! How about “… choosing the values that result in the highest performance” in some form of hold-out or test set.

This process is done in the bit of code that I showed you today (not nested) -> basically the first level of cross validation.

If we stop here then what we're left with is a model trained on the entire data. This doesn't give us an idea of how it will generalize (even though we accounted for overfitting with our elastic net penalizations?). So even though we did cross validation already, that was only to optimize lambda and alpha.

Correct. The CV you’ve done by calling cv.glmnet (or via the comparable options in caret) will have tried to minimize overfitting due to optimization of the tuning parameters.

To get an idea of it's generalizability, we need to cross validate the entire model building process (folds within folds). So this is what you were saying @rbalshaw about writing a 'for' loop on that whole bit of code.

Yep. Please treat that “writing a for loop” as a hand-waving example, not as an instruction. This outer loop should involve another layer of resampling/cross-validation, but this time to get the of overfitting associated with the fitting of the final model to the full data.

And, recall that we should be using the same “estimation pipeline for each of the repeats of this outer loop. This is where we estimate the degree of overfitting that our pipeline tends to induce when we use it for datasets that look just like our original full datset (i.e., the datasets generated in the resampling/cross-validation in the other loop).

I'm wondering now what an appropriate number of folds is for our dataset.

Since we have 45 samples (with 33 Caucasians, 11 Asians), if we took k = 5 for the first cross validation sampling (for measuring test error), then we get folds of n = 9. Within these folds, if we create more folds k' = 5, then we would get a sample size of n' = ~2 for each fold. This seems like it would each time a model's accuracy is calculated, it wouldn't be very reliability, because the error is only measured on two predictions. Would a k = 3, and k' = 3 be more appropriate for our dataset?

Let’s think this through again. And, I’ll start by saying this sometimes gets to be a real pain when you have smaller sample sizes.

But, working with CV at each stage, you might want to use 5-fold validation for the outer loop. Your folds would each have 9 samples. But, that means that each time you start on the inner loop, you’ll be handing it the other 4 folds = 36 samples.

Thus, the inner cross-validation, where you are optimizing the tuning parameters, will be done with fold-sizes of 36/5 = 6-7. And, each of these model fits will involve 80% of the 36 samples, testing against the held out fold with either 6 or 7 samples.

Yep. Getting a bit sparse on data - but feasible I think.

One extra trick, since we’re down into the details here. Most software will permit you to specify that the cross-validation (or whatever resampling method used) takes into account the relative number of “cases” and “controls”. This can be important when you get into some of these schemes or you’ll end up with some folds that have no cases (or no controls) and the modelling might tip over on you.

Certainly, if you are rolling your own resampling loops, using a resampling strategy that is stratified by the phenotype, you will be avoiding the question of “how good is a model that the software can’t fit because we don’t have enough events in this training set?"

Thanks, Victor

Awesome questions! Hope this helps.

Rob

wvictor14 / team_Methylation-Badassays

Nested Cross Validation #6