Open nivretta opened 7 years ago
Hi @STAT540-UBC/team-badassays
Thank you for submitting the progress report.
A few comments:
I can see that you are having different folders in your repo for different part of the analysis. make sure you keep having clean and organized repo once you add more scripts and files.
Overall, your main progress was in the preprocessing part. I understand this took longer than you expected but you want to make sure that your project is doable within the timeline of the class. The good side is that you are now sure about the quality control part of your project.
Since logistic regression is not fully covered in the class, you might need to spend some time to learn about it. Also, implementing CV might be tricky as well. So, I suggest that you put as much time as you can in building your classifiers as soon as possible. I can see a great analytical power in your group! Make sure you put enough time daily! Go ahead!
I guess it was not clear for you how to write the progress report. The questions in the rubric were to give to an idea of what you mainly need to include in the progress report. It is okay that you wrote it in this way. But we were mainly asking for having different sections of preprocessing, progress on methodology and some results. We did not mean to have specific answers for each of the questions in the rubric.
Thanks for references.
For logistic regression, you can use glm function. With this, you compute the probability of being Asian to probability of not being Asian (which is caucasian in your case) as your response.
Please make sure you keep yourself on the track! so you progress.. ask your questions and feel free to ask for meeting. Rob is at BC Centre for Disease Control so you can visit him there.
@rbalshaw your thoughtful comments are highly welcomed.
Good luck team! :)
It looks like you have made good progress getting your data into R, reviewing it for quality, and conducting normalization, etc.
Your plan, laid out in S.1.2 looks pretty solid. A regularized logistic regression model seems a sensible thing to try. Cross-validation as you describe (and as packages like caret should make fairly straight-forward) will help you to understand the performance of the model for identifying the Caucasian vs. Asian samples and help reduce overfitting.
You next plan (step 3) to do unsupervised analyses of these data (PCA) and hope to see that some of the PCs are associated with self-reported ethnicity in the training data. This is a sensible idea, but I tend to think of this type of analysis as a precursor to the logistic regression (a supervised technique). Not a big deal, though. Plotting these PC values for the test data -- where you cannot confirm the ethnicity -- will be very interesting.
I would suggest that you could also do a PCA using only the features selected by the regularized logistic regression. This plot will almost certainly show some differences between the ethnicities in the training data (you should think about this and make sure it's clear why this is so) -- and if you are lucky, and your hypothesis is valid, you may see similar structures when you plot these PCs for the test data.
You have a bit of a hurdle to clear with getting your processed data back into R - but that seems something that we might be able to look at over the phone and with screen sharing (Webex or Skype?)
Please let me know if anyone on the team would like to chat. Best would be to contact me by email: robert.balshaw@bccdc.ca
Hey @STAT540-UBC/team-badassays
Please make sure most of your team members are coming to seminar tomorrow :) Rob will be there as well! We can discuss things in your project together.
@rbalshaw @farnushfarhadi