xihaoli / STAARpipeline

An R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline
GNU General Public License v3.0
60 stars 20 forks source link

What are the differences between the individual part of the STAARpipeline and the traditional GWAS using the plink method? #20

Open jingydz opened 2 months ago

jingydz commented 2 months ago

Hello, I would like to ask, what are the differences between the individual analysis in the STAARpipeline (analysis centered on a single locus variable) and the traditional GWAS analysis using the plink method?

xihaoli commented 2 months ago

Hi @jingydz,

They are essentially the same. As an all-in-one pipeline for phenotype-genotype association analyses, we have implemented individual (single-variant) analysis as part of STAARpipeline.

Best, Xihao

jingydz commented 2 months ago

Thank you very much.

Essentially, they are the same, but traditional GWAS with PLINK would filter out variants after MAF (Minor Allele Frequency) > 0.01 or MAF > 0.05 for analysis, right? In the case of STAARpipeline analysis, the default setting does not filter by MAF, correct?

For the input into STAARpipeline, the variants I included encompass both common and rare variants (without filtering by MAF). So, it can be understood that the significant sites obtained from traditional PLINK GWAS are often common variant sites, while the significant sites obtained from the univariate analysis in STAARpipeline include both common and rare variant sites, can this be understood correctly?

(The significant sites I obtained from traditional PLINK GWAS can be replicated in the univariate analysis of STAARpipeline for about 12% of the sites, and the gene-centric coding and non-coding as well as ncRNA, and the 2kb sliding window method are all unable to replicate the significant sites from traditional PLINK GWAS, is this normal?)

What I ultimately want to know is, if I have already calculated common significant sites using traditional PLINK through GWAS, and next I want to calculate rare significant sites using the STAARpipeline, which of the several methods you provided should I tend to trust?

Using the variant-centric method in STAARpipeline, I obtained 247 significant sites (the QQplot looks reliable). Using the gene-centric coding region method in STAARpipeline, I obtained 2 significant genes (the QQplot looks reliable). Using the gene-centric non-coding region method in STAARpipeline, I obtained 3 significant genes (the QQplot looks reliable). Using the gene-centric ncRNA method in STAARpipeline, I obtained 8 significant genes (the QQplot does not look reliable). Using the 2kb sliding window method in STAARpipeline, I obtained 82 significant windows (the QQplot for sliding_window_qq_skat looks reliable), and I have annotated them, finding that there are 48 sites completely consistent with the 247 significant sites from the univariate analysis.

Now, with too many results, I really don't know which to choose as the significant sites for rare variants. Can you give me some advice?

Thank you again in advance for your help!

xihaoli commented 2 months ago

Hi @jingydz,

Essentially, they are the same, but traditional GWAS with PLINK would filter out variants after MAF (Minor Allele Frequency) > 0.01 or MAF > 0.05 for analysis, right? In the case of STAARpipeline analysis, the default setting does not filter by MAF, correct?

In the individual analysis of STAARpipeline, the default filtering step is to keep all variants with mac_cutoff=20. You can further choose to filter out variants once the individual analysis is completed.

For the input into STAARpipeline, the variants I included encompass both common and rare variants (without filtering by MAF). So, it can be understood that the significant sites obtained from traditional PLINK GWAS are often common variant sites, while the significant sites obtained from the univariate analysis in STAARpipeline include both common and rare variant sites, can this be understood correctly?

This is correct. Again, for the individual analysis in STAARpipeline, you can also only keep the results for variants with MAF > 0.01 or MAF > 0.05.

(The significant sites I obtained from traditional PLINK GWAS can be replicated in the univariate analysis of STAARpipeline for about 12% of the sites, and the gene-centric coding and non-coding as well as ncRNA, and the 2kb sliding window method are all unable to replicate the significant sites from traditional PLINK GWAS, is this normal?)

Are you using the same dataset for PLINK GWAS and STAARpipeline? If so, then you are not "replicating" the signal from one study to another.

What I ultimately want to know is, if I have already calculated common significant sites using traditional PLINK through GWAS, and next I want to calculate rare significant sites using the STAARpipeline, which of the several methods you provided should I tend to trust?

All the methods/modules provided in STAARpipeline is statistically valid, but QQ plots depends on the a list of practical considerations (traits being continuous/binary, whether you have included all plausible covariates in the null model fitting, max MAF cutoff used for defining rare variants, minimum #variant or cMAC cutoff for rare variant set results, etc.). I would recommend you discuss with @yuxinyuanqt to ask if you have any specific questions about your analysis results.

Hope this helps.

Thanks, Xihao