xihaoli / STAARpipeline

An R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline
GNU General Public License v3.0
58 stars 18 forks source link

Does this pipeline support sex chromosomes? #19

Open jingydz opened 1 month ago

jingydz commented 1 month ago

When I am running the process to the step of STAARpipeline/0.2.1Varinfo_gds.R, if I provide the chromosome as a sex chromosome, it will automatically recognize it as chrNA. I would like to know if this procedure supports calculations for sex chromosomes.

xihaoli commented 1 month ago

Hi @jingydz,

Thanks for your question. 0.2.1Varinfo_gds.R is part of FAVORannotator, where some functional annotations that are available for autosomes are not available for sex chromosomes. The FAVORannotator team is currently expanding the collection of the annotation database to include more information for sex chromosomes.

On the other hand, if you have generated the GDS file for sex chromosomes, you may skip the FAVORannotator step and perform association analysis for sex chromosomes.

Best, Xihao

jingydz commented 3 weeks ago

Okay, thank you. I am able to generate the GDS file for the sex chromosomes, but when running Varinfo_gds.R for the sex chromosomes, it does not recognize X and Y, treating them as NA. The same issue occurs in the Annotate.R step. If I still want to include the sex chromosomes, when calculating the PCs, should I use the results from all chromosomes or just the autosomes? Or, can I run the process once for the autosomes and then run it separately just for the sex chromosomes?

yuxinyuanqt commented 3 weeks ago

Hi @jingydz ,

Thanks for your question. STAARpipeline currently does not take special care of sex chromosomes, so you can calculate PCs and GRM using only the autosomes. Due to the complex biological mechanisms of the X chromosome, standard analysis methods for autosomes may not be directly applicable to sex chromosomes (Sun et al. Am J Hum Genet, 2023). Therefore, additional adjustments to the input genotype files may be necessary before conducting association tests for chrX and chrY. We plan to expand STAARpipeline's functionality in future versions to better support the analysis of sex chromosomes.

Hope this helps.

Best, Yuxin

jingydz commented 3 weeks ago

Thank you for your answers. I have now removed the calculations for the sex chromosomes.

I have another question: the QQ plot for the non-coding region that I obtained is very erratic. What could be causing this, and does this mean that most of the results from the non-coding region are not reliable? image

Furthermore, in a gene-centric Manhattan plot, does each dot represent a gene? And in an individual-centric Manhattan plot, does each dot represent a variant locus? Is that the correct way to interpret it?

jingydz commented 3 weeks ago

I am a bit puzzled about your script. https://github.com/xihaoli/STAARpipeline-Tutorial/blob/main/STAARpipelineSummary_Gene_Centric_Noncoding.r#L46 The alpha value is set at 3.57E-07, which means the -log10 of this p-value should be around 6.4472, correct? Yet, the red threshold line in my plot is a number below 6, is that supposed to be the case?

yuxinyuanqt commented 3 weeks ago

Hi @jingydz ,

It seems that the gene-centric noncoding analysis results of ncRNA genes may be inflated. Did you control for covariates such as age, sex, and the top 10 principal components (PCs)? Also, is there any relatedness among your samples? How do the results from the Single Variant Analysis and Gene-Centric Coding Analysis look? Do they appear reasonable?

Yes, you are correct. In a gene-centric Manhattan plot, each dot represents a functional category of a gene. In an Single Variant Analysis Manhattan plot, each dot represents a variant locus. Specifically, the gene-centric noncoding analysis includes eight genetic categories of SNVs, comprising seven noncoding functional categories of protein-coding genes and one noncoding functional category of ncRNA genes.

You need to specify the 'alpha' and 'alpha_ncRNA' parameters in the _Gene_Centric_Noncoding_ResultsSummary function, which represent the p-value thresholds of significant results of protein-coding genes and ncRNA genes, respectively. The default values for both are set to 2.5E-06, which is why the red threshold line in your plot is below 6 (5.60206). The default significant threshold is defined by multiple comparisons using the Bonferroni correction (0.05/20,000). You can use this default significant threshold to filter significant results.

Best, Yuxin

jingydz commented 3 weeks ago

I have only accounted for the first ten principal components and have not controlled for age and gender. Furthermore, genetic relationships between the samples have been ruled out, so there is no correlation. The results of the univariate analysis and the gene-centric coding analysis appear to be good; the univariate results start to drift around an expected -log10(p) value of approximately 3.7, and the gene-centric analysis begins to drift around 0.8, but overall the drift is not significant, and it seems generally reasonable. In the gene-centric non-coding analysis, except for enhancer_DHS, everything else seems quite reasonable.

image image image

The threshold for individual analysis is set at alpha < -5E-08 (https://github.com/xihaoli/STAARpipeline-Tutorial/blob/main/STAARpipelineSummary_Individual_Analysis.r#L44). For gene-centric coding analysis, the threshold is alpha < -5E-07 (https://github.com/xihaoli/STAARpipeline-Tutorial/blob/main/STAARpipelineSummary_Gene_Centric_Coding.r#L46). For gene-centric non-coding analysis, the threshold is alpha < -3.57E-07 (https://github.com/xihaoli/STAARpipeline-Tutorial/blob/main/STAARpipelineSummary_Gene_Centric_Noncoding.r#L46). You mentioned that the default value for alpha_ncRNA is 2.5E-06, which is not customized in the original script and is not specified within the Gene_Centric_Noncoding_Results_Summary function. If I wish to maintain the same level of significance for gene-centric non-coding analysis (referred to as alpha_Noncoding) and alpha_ncRNA, should I define both alpha <- 3.57E-07 and alpha_ncRNA <- 3.57E-07 in the STAARpipelineSummary_Gene_Centric_Noncoding.r script, as well as include them in the function call as Gene_Centric_Noncoding_Results_Summary(..., alpha=alpha, alpha_ncRNA=alpha_ncRNA...), is that correct?

yuxinyuanqt commented 2 weeks ago

Hi @jingydz

It seems that the results from the Gene-Centric Coding Analysis and Gene-Centric Noncoding Analysis show some inflation. I suspect the main issue could be related to fitting the null model, which depends on a list of practical considerations.

What is your total sample size? Is your phenotype continuous or binary? If it is continuous, does it follow a normal distribution? If not, you might need to perform a rank-based inverse normal transformation. If the phenotype is binary, is there an imbalanced case-control setting? If so, you should specify use_SPA=TRUE in the fit_nullmodel function. Additionally, I suggest considering age, sex, and age^2 as covariates, along with the top 10 PCs.

Have you checked the cumulative Minor Allele Count (cMAC) for the significant results in both the Gene-Centric Coding Analysis and Gene-Centric Noncoding Analysis? In the _Gene_Centric_Coding_ResultsSummary (https://rdrr.io/github/xihaoli/STAARpipelineSummary/man/Gene_Centric_Coding_Results_Summary.html) and _Gene_Centric_Noncoding_ResultsSummary (https://rdrr.io/github/xihaoli/STAARpipelineSummary/src/R/Gene_Centric_Noncoding_Results_Summary.R) functions, you can specify the cMAC_cutoff parameter, which sets the minimum number of cumulative minor alleles in the masks when summarizing results (default is 0). For example, you could set cMAC_cutoff to 20. In summary, there are many potential causes for inflation, so I recommend verifying whether the null model is correctly fitted.

In the _Gene_Centric_Coding_ResultsSummary (https://rdrr.io/github/xihaoli/STAARpipelineSummary/man/Gene_Centric_Coding_Results_Summary.html) and _Gene_Centric_Noncoding_ResultsSummary (https://rdrr.io/github/xihaoli/STAARpipelineSummary/src/R/Gene_Centric_Noncoding_Results_Summary.R) functions, you can specify the p-value thresholds for significant results. Specifically, the alpha parameter in _Gene_Centric_Coding_ResultsSummary, and the alpha and alpha_ncRNA parameters in _Gene_Centric_Noncoding_ResultsSummary. In the gene-centric noncoding analysis of protein-coding genes, since there are seven noncoding functional categories, alpha is set to 0.05/(20000*7) = 3.57E-07 in the _STAARpipeline-Tutorial /STAARpipelineSummary_Gene_CentricNoncoding.r. For the gene-centric noncoding analysis of ncRNA genes, as there is only one functional category, alpha_ncRNA is 0.05/20000 = 2.5E-06. For more details, please refer to Li Z, Li X, Zhou H, et al. (Nat Methods, 2022).

When summarizing the results from the Gene-Centric Coding Analysis and Gene-Centric Noncoding Analysis, you can choose to use a more relaxed p-value threshold, such as 0.05/20000 = 2.5E-06, or select an appropriate threshold based on the specific results. This choice depends on your needs and the actual analysis outcomes.

Best, Yuxin