Switching Data but keeping approach same

wvictor14 commented 7 years ago

@santina @singha53

Hey guys, I talked to my lab at lab meeting today and they suggested we use some different data the STAT540 project vs what I initially proposed.

(for reference here is the project initial info issue in the discussion: https://github.com/STAT540-UBC/Discussion/issues/132)

Instead of looking for differences in DNAm between ethnic groups in placental samples, we can look for differences placental cell types. The main argument being that there won't be many differences (~order of 10s) between ethnic groups (because a previous student has done some preliminary analysis), but between cell types/tissues we should see is on the order of 10000's (cell heterogeneity in DNAm is well known). It'll basically be the same project (detecting differentially methylated CpG sites between two groups) just using different data.

The only caveat is that the sample size is n = 4 for the tissues they suggested (4 samples per tissue, 2 tissues), which has obvious implications on the stats. They argue that it can be done though because the differences will be large enough.

What do you guys think?

Thanks, Victor

ppavlidis commented 7 years ago

@rbalshaw is your designated instructor advisor, and @farnushfarhadi is your designated TA (at least according to the spreadsheet we made). Make sure you engage them in this discussion.

Since I'm here:

First: you should copy your project description to this repo's readme, not refer back to that long thread.

More importantly: I'm down-voting (not quite ready to kill outright) the tissue comparison idea. Your entire proposal now boils down to "identify a set of top-ranked CpG sites that associate with tissue". As you suggest that's going to be too easy. You barely need to take our course to analyze tissue differences :)

The ethnicity data set seems much richer in analysis possibilities since it's much larger and because the differences are small: you presumably have covariates like age (and what else?), and more complex QC, and will generally have a bigger challenge finding convincing signals. Maybe you can extract age (or whatever) effects "for free" and look at them too. And then you had a second part about using ethnicity to help interpret the placental pathology data set. Sounded pretty good to me.

There's a lot of literature on ethnicity (or at least ancestry) differences in methylation, so you'd have to deal with that (it's a positive: maybe you're looking at a novel tissue, maybe you can compare results, etc.). It sure sounds more interesting than the tissue data set.

I could be wrong - if there's no signal at all, then it will just be frustrating - and if so I'd want you to find yet something else, or greatly beef up the "tissue comparison". Please discuss!

singha53-zz commented 7 years ago

I'll have to second Paul and add that a limited sample size will prevent you from really appreciating the tools applicable for high-dimensional data taught in this course (PCA, clustering, classification, cross-validation, etc). A

rbalshaw commented 7 years ago

Paul points out that if there is no signal at all, then it will just be frustrating... but somehow, your earlier write up made it sound like you think there is signal. Perhaps your sample size may be a little small to have high power - but generating information about the possible size of the signals could be a valuable contribution, even if the signals are small enough that these data cannot confirm them.

In other words, don't be afraid to think of datasets like these in the context of "pilot studies". It costs little to do the analyses (beyond your time and energy!) and you may help future researchers understand the challenges/potentials in this area.

(But don't disregard Paul's thought. There is little sense in repeating work that already established there isn't a signal of interest.)

wvictor14 / team_Methylation-Badassays

Switching Data but keeping approach same #1