xihaoli / STAARpipeline-Tutorial

The tutorial for performing single-/multi-trait association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary
GNU General Public License v3.0
26 stars 17 forks source link

Controls / cases counts inverted when using binary model #63

Open GACGAMA opened 5 months ago

GACGAMA commented 5 months ago

Hello!

I've finished analyzing a large cohort using 0/1 numerical values for control/case status respectively. Then I went to my vcf to count the distributions of variants for each group. But what I saw was the contrary of what I expected, I found things that seemed enriched in controls instead of case. Is that the expected order for the enrichment?

I observed that when looking for the percentage comparing controls and cases, the results are both ways - some seemed enriched for controls (most of them) and some for cases, so I'm not sure how to interpret this

Should I repeat by inversing 0/1 as case/control respectively to find things enriched in cases?

One example:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

gene_step | gene | Cases_Sum_nHet | Control_Sum_nHet | Total X more in controls -- | -- | -- | -- | -- MUC4_ODMS_WGS_WES_MUC4_synonymous | MUC4 | 261 | 2829 | 10.83908046 METTL1_ODMS_WGS_METTL1_promoter_CAGE | METTL1 | 4 | 0 | 0

To count, I summed the genotype calls for each group (case and control) and summarized it for each gene-category pair. As you can see, I have a lot more total Heterozygous calls for controls in one example and cases in another

xihaoli commented 5 months ago

Hi @GACGAMA,

Since you are working on variant-set analysis, is your Cases_Sum_nHet defined as the sum of heterozygous calls across all case samples, or is it the total number of heterozygous carriers among case samples?

Overall, this is possible, as the effect sizes of different variants in the variant set could be in various directions- some variants have a protective effect, while others may have a deleterious impact on the outcome. As such, the heterozygous counts may not always be enriched in the case samples.

GACGAMA commented 5 months ago

Hi @xihaoli In this case, the Cases_Sum_nHet is defined as the sum of the heterozygous carriers among case samples for each specific gene-function pair, such as gene-plof for example. Is there any better way to find the direction of enrichment for each category of enriched genes?

kwdoyle commented 3 months ago

To jump in on this, @xihaoli, it seems the Score_Stat metric obtained for the variant sets after running the STAARpipelineSummary_Gene_Centric_[Coding/Noncoding]_Annotation.r scripts relate to the magnitude and direction of the effect for a given variant. In my analysis, variants that have a higher percentage in cases vs controls also have a larger Score_Stat. Am I interpreting this correctly?

GACGAMA commented 3 months ago

@kwdoyle As a follow-up: I agree with you. Score_Stat is related to the direction, but only when looking at individual variants. I counted the variants in each of the binary outcomes and score_stat direction is how many in each group. Individual analysis significant variants are enriching the signal for the genes, but the model also include multiple non-significant variants (the same as including a non-significant variable in a regression model to increase fit, as I understand). We should have a way to visualize those things for the variants within STAAR results, the simplest one is to include the number of samples in each cohort for binary outcomes, but I'm not sure what would be usefull for quantitative outcomes

xihaoli commented 3 months ago

Thank you both!

@kwdoyle: Score_Stat reflect the variant effect direction and is for individual variants (@GACGAMA is correct). Regarding magnitude, you may look at effect size, which is currently provided in the output of Individual_Analysis.R from the STAARpipeline package.

@GACGAMA: Thanks for your patience. I agree that the simplest thing to look at is to include the number of samples (carriers) in each cohort for binary outcomes. Have you figured out a way to do it now?

Best, Xihao

GACGAMA commented 3 months ago

@xihaoli I have figured a way to do that by using R, but I've used the VCF data which is not the best, because the .GDS files are available. Im not sure how to manipulate the GDS files but my scripts should be easily usable in this context if it is adapted I'm available to help if needed

xihaoli commented 3 months ago

Thank you @GACGAMA. I have added you as a collaborator of the STAARpipeline-Tutorial repo. Please feel free to contribute if there is any improvements that you wish to make.