Closed wvictor14 closed 7 years ago
Hi @wvictor14
the reason why it might be taking time is because ur dataset is very large (methylation data right?) one way to speed things up is to filter out genes with little variance (and keep say the top 5000?). remember you are interested in a biomarker panel of a few cpgs to predict response.
since your dataset is small dont have panels that are very large --> controlled by alpha. alpha =0 keep all genes to predict response. alpha = 1 --> least number of genes. in the lectures i used and alpha = 1 to get the smallest panel. (See lecture slides to determine how to extract cpgs in the panels)
play around with those two parameters and let me know how it goes. looks good so far! Best, A
Hey, thanks for the advice.
If we wanted to do prefiltering, then we would need another layer of Cross validation for this right? I think you can use 'sbf()' and 'sbfcontrol' for this, like how you did in the code for testing knn model in lecture 19.
Victor
@wvictor14 As Rob said. pre-filtering is done independent of class labels (see snippet from lecture notes):
The filter I did using sbf is called feature selection using univariate filtering where the class labels are used. I used this for knn which does not perform feature selection unlike glmnet. This is ofcourse performed with cross-validation. Therefore you can pre-filter (remove) cpgs without looking at the class labels of samples (e.g. variance across all samples). Let me know if this makes sense.
Yup the prefiltering stuff makes sense. Thanks!
@singha53
Hey Amit, I ran your code to do the nested CV. See .md file here. It took about 5 hours to run. Would you mind taking a look at it? Scroll to the bottom.
A few questions:
the 'list of lists' that's printed (I called the object 'netTesterror'). Each number corresponds to the probability that that particular sample is in class 'Asian', correct?
I'm just wondering why this AUC is so high (99.5 +/- 0.9%) and if I should be suspicious of it
I did a folds = 5, and repeats = 3
Also, I would like to make sure my distribution of samples is proportional to the original dataset. (11 ASians vs 33 Caucasians), Do you have advice on how to implement this in the code you provided?
Thanks for the talk yesterday, it was really helpful,
Victor