Project proposal of team_Methylation-Badassays

annizubc commented 7 years ago

@rbalshaw @farnushfarhadi the last commit: 7ce221a00db3b079f614041d2bf61afa89c415df the link to the proposal: https://github.com/STAT540-UBC/team_Methylation-Badassays/blob/master/project_proposal.md

farnushfarhadi commented 7 years ago

Hi @STAT540-UBC/team-badassays

Thank you for writing up the final proposal. There is a great progress from your initial proposal to the final proposal. However, there are still many parts that need more clarification which I believe will be Ok once you study and learn more about your project and methodology and progress. Also, I believe today's lecture was beneficial for your team since it was addressing key points in DNA methylation analysis :)

@rbalshaw and I reviewed your proposal. What you need to think more about are..

Please make sure about what is your question and what are you going to find by your methodology. you need to keep asking these two critical questions in any part of your analysis.
Please note that your second dataset has other genetic ancestries rather than Asian and Caucasian.. This will be challenging. Think about how markers from first dataset would help you to determine genetic ancestries of Asian-Caucasian-plus-other-genetic-ancestries of the second dataset? Also, what are your addressing by true positives?

More guidelines from Rob:

For cross-validation part: I would have suggested that you revise the project plan to start with the first dataset, finding the sites that differ using the entire dataset. You could then choose one of many methods for finding a multi-feature classifier. Then, you could use cross-validation of this first dataset to investigate the optimism of the performance of their proposed classifier.
Maybe you want to see whether your classifier does anything sensible with the second dataset (where you don't know all genetic ancestries). This might be best done graphically, plotting the linear predictors(?) from the classifier and seeing if there is a multimodal distribution? Or, you could go back to the set of DE methylation sites and do exploratory plotting (PCA? Heatmaps) using only those DE sites.
Division of labour looks good. Please make sure you put enough time on each part of the analysis and also, all of you know all parts of the project and are ready for questions in the poster session.

We would be happy to meet with the team and discuss further :) Good luck with your interesting project!

wvictor14 commented 7 years ago

Thanks for the comments and helpful suggestions @farnushfarhadi and @rbalshaw !

We have some thoughts and responses to your comments, which I will respond to inline:

Please note that your second dataset has other genetic ancestries rather than Asian and Caucasian.. This will be challenging. Think about how markers from first dataset would help you to determine genetic ancestries of Asian-Caucasian-plus-other-genetic-ancestries of the second dataset? Also, what are your addressing by true positives?

That's a good point that we don't know the ethnicities in our second dataset so they might have other genetic ancestries than our first dataset. However, maybe we can adjust our goal from "Predict the ethnicities based on DNAm data" to asking the question "Are these samples from this other dataset more epigenetically 'Asian' or 'Caucasian'?". That is, maybe these samples aren't 'Asian' or 'Caucasian' but can we use our identified CpGs to say which samples are more Asian or Caucasian (based on our CpGs) and vice versa?

The goal then, is to build a tool that let's researchers estimate the ethnic heterogeneity in their dataset (not necessarily predict exactly the ethnicity).

For cross-validation part: I would have suggested that you revise the project plan to start with the first dataset, finding the sites that differ using the entire dataset. You could then choose one of many methods for finding a multi-feature classifier. Then, you could use cross-validation of this first dataset to investigate the optimism of the performance of their proposed classifier.

I'm not sure exactly what you mean by this. Could you please clarify why adding the second dataset to the first and then identifying the differentially methylated sites would be useful? Considering that the ethnicities are unknown in the second dataset, wouldn't they be useless in building a classifier that predicts ethnicity?

Currently, we are still doing the preprocessing of the data. By the project proposal deadline we plan to accomplish the following things:

QC, preprocessing of raw data, adjust for Batch effects
Preliminary analysis i. Clustering / sample-sample correlation heatmaps
Have more detailed plan for the analysis

In general I'm unsure about what the workflow looks like after this. How should we build the 'classifier'? I'm thinking this is what we could possibly do: i. separate into training and test datasets ii. Linear modeling - identify CpG sites affected by Ethnicity (do we model all covariates and then pull out the effects of Ethnicity? or model just Ethnicity) iii. Clustering based on identified CpG sites - see if our samples separate out in train iv. Clustering based on identified CpG sites - in test v. Clustering based on identified CpG sites - in external dataset

Thanks, Victor

rbalshaw commented 7 years ago

Associated with point 1:

I think you're on track here. Your first data set allows you to assess if CpG sites can differentiate between self-reported Asian vs. Caucasian. There are several steps here. Then, cross-validation in this data set would allow you to assess how effectively your CpG's can do that (as measured by Sens and Spec, AUC, etc.). This is a largely a supervised learning problem (supervised vs. unsupervised used as the machine learning folks use them...) where you are trying to predict which group each sample belongs to.

Next, you take the CpG and the classification rule you have developed in dataset 1 and check to see if the CpGs you have decided are relevant still appear relevant in this second dataset. This is trickier and will require a bit more imagination -- but I think that your goal here could be to demonstrate that data set 1 taught you which CpGs to look at (and actually proposed a way to combine them into a Asian vs. Caucasian "score") and this should permit you to adjust for at least this type of heterogeneity in a data set where self-reported ethnicity is not collected.

Associated with Point 2:

I didn't intend that point 2 would involve data set 2 at all. You're right - you can't really confirm anything using this dataset, as you have no "known labels".

Rather, as I suggested above, data set 2 could perhaps be used to demonstrate the potential usefulness of what you were able to learn from data set 1.

But, as you've heard, one of the challenges in many of these studies is the risk of over-fitting -- too many possible parameters to estimate and not enough data. Cross-validation is one statistical technique that can be used to control for and assess the degree of over-fitting. Try googling "correcting for optimism in statistical modeling". (Statsgeek had a nice little example...)

Your "future workflow" looks heavily reliant on linear regression and then clustering.

What is(are) the outcome variable(s) for your linear regression models?
If Asians cluster together apart from the Caucasians, how do you use that "result" in future? Or would it be just an interesting observation?

Hope this helps.

MingWan10 commented 7 years ago

@rbalshaw @farnushfarhadi Would you guys have a look at our draft analysis plan, maybe we can discuss it tomorrow during seminar time:

One way we can think of, in order to identify CpG sites that are helpful in predicting ethnicity, is using classification methods like logistic regression:

using only the training data set, fit logistic regression with Prob(ethnicity ="Asian") as the response and all CpG sites as predictors, use regularization techniques (LASSO?) to choose which CpG sites should be kept, then cross-validate our model; One concern I have would be, if we throw in all 400k predictors at once, would it cause computational problems? Should we first fit a logistic regression model with one CpG site at a time, so we can trim down 400k sites to say, 1k top candidates?
apply our trained logistic classifier to the test data set, although we cannot fully "test" our results;

Or, we can also experiment with unsupervised clustering methods like PCA: Merge both training and test data set, use PCA to visualize which of the PCs are good classifiers of ethnicity given the labels we have, use identified PC(s) as classifiers for samples without labels (i.e. samples in the test data set). Would you consider this to be a rigorous method? If we identify some PCs as classifiers, should we expect one of them to be similar to the logistic regression classifier?

wvictor14 commented 7 years ago

Hi @MingWan10

Would just like to add one thing.

With regards to (2.), Rob mentioned, and I think it is a good idea, at the beginning, to randomly (we would probably need to keep the proportions of Asians vs Caucasians the same though) separate our first dataset (with known ethnicity) into 'training' and 'testing' subsets. We can use the training subset to build the classifier and then test on our testing subset to get an idea of the accuracy of the classifier.

Victor

rbalshaw commented 7 years ago

Sorry I cannot attend the seminar this afternoon.

Your plans are sounding sensible. @wvictor14 describes what amounts to one "fold" in the cross-validation strategy that @MingWan10 is describing in his step 1.

As for including all 400k predictors in one run of a penalized regression, I'll have to leave that to you. But -- just be careful that if you do some form of "preselection" that uses the ethnicity, you must include that step in the cross-validation process.

For example, say someone had done 100 single-predictor regressions and then used only the 15 that had p-values<0.20 in a stepwise regression analysis. If they only cross-validate the stepwise regression part of their process, they will vastly overestimate how good their final model really is.

They would need to cross-validate both the single-predictor regression analyses and the subsequent stepwise regression analysis to be sure that their estimates of performance are not overly contaminated by the optimism that you get by "testing on the training set".

(Of course, I'm not recommending one-predictor regression followed by stepwise regression... just the cross-validation approach.)

rbalshaw commented 7 years ago

FYI - I just replied to the GitHub issue. I have lost confidence that everyone will get that update.

Rob

On Mar 15, 2017, at 9:04 AM, Victor Yuan notifications@github.com<mailto:notifications@github.com> wrote:

Hi @MingWan10https://github.com/MingWan10

Would just like to add one thing.

With regards to (2.), Rob mentioned, and I think it is a good idea, at the beginning, to randomly (we would probably need to keep the proportions of Asians vs Caucasians the same though) separate our first dataset (with known ethnicity) into 'training' and 'testing' subsets. We can use the training subset to build the classifier and then test on our testing subset to get an idea of the accuracy of the classifier.

Victor

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/STAT540-UBC/team_Methylation-Badassays/issues/3#issuecomment-286790382, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AS8Q__R8P3rSw-hU6-YjR8s3MMtzFrjiks5rmAv-gaJpZM4MCdZ4.

farnushfarhadi commented 7 years ago

Hi team,

I am excited to know more about your project! you are the first group I will be talking to today!

See ya

farnushfarhadi commented 7 years ago

@rbalshaw THANK YOU very much for your great and helpful comments.

MingWan10 commented 7 years ago

Thanks @rbalshaw for your suggestions! After today's lecture on CV, we also realized initial screening of predictors isn't really a great idea if we go with logistic regression + regularization, so we will put in all predictors at once. There are other classification models though, as @farnushfarhadi pointed out during our discussions today, we could also try KNN, SVM or linear discriminant analysis etc. Do you guys have insights on which method we should try out? (+ @singha53 )

singha53-zz commented 7 years ago

@MingWan10 Good point re: which method should we try out. I will incorporate that for the next lecture after I teach regularization. I will compare the methods we learned to date (e.g. KNN, penalized logistic regression and SVM using the caret package). Note: you can include the screening of predictors (ie. feature selection) to build your classifier as long as you also include it in the cross-validation folds. For now I recommend trying out the code I put in today's lecture on your dataset.

wvictor14 / team_Methylation-Badassays

Project proposal of team_Methylation-Badassays #3