Conditional analysis - Summary Gene Centric Noncoding not running to completion

xihaoli / STAARpipeline-Tutorial

The tutorial for performing single-/multi-trait association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary

GNU General Public License v3.0

26 stars 17 forks source link

Conditional analysis - Summary Gene Centric Noncoding not running to completion #58

Open samreenzafer opened 7 months ago

samreenzafer commented 7 months ago

Thanks for a great easy to use tool. I've finally reached the step where I would like to do a Gene Centric Noncoding conditional analysis on some select rare variants I identified in my data set. As per the tutorial, I am running the STAARpipelineSummary_Gene_Centric_Noncoding.r script, and this time give a nonNULL input to the known_loci variable.

known_loci CHR POS REF ALT 1 2 168944600 C T 2 2 168976689 T G 3 2 168973831 C A 4 2 168976617 T C 5 2 168979945 A G 6 2 168990819 T C 7 2 168993859 G A 8 2 168996703 C T 9 2 168996703 C T 10 2 169013329 G T 11 2 168970065 C T 12 2 168990800 C T 13 2 168995461 C T

The script first generates the combined summary stats as before `

UTR.Rdata upstream.Rdata downstream.Rdata promoter_CAGE.Rdata promoter_DHS.Rdata enhancer_CAGE.Rdata enhancer_DHS.Rdata results_ncRNA_genome.Rdata UTR_sig.csv upstream_sig.csv downstream_sig.csv promoter_CAGE_sig.csv promoter_DHS_sig.csv enhancer_CAGE_sig.csv enhancer_DHS_sig.csv ncRNA_sig.csv noncoding_sig.csv

`

and then went on running for over 24hours before being killed, and never produced any of the*_cond_sig.* files. I'm wondering if this is indeed a very long step, and expected behavior ?

xihaoli commented 7 months ago

Hi @samreenzafer,

Thanks for the question. If the input to known_loci is particularly long (e.g. thousands of variants directly extracted from GWAS Catalog), then you should perform LD pruning to select the subset of independent variants from the full list for conditional analysis.

Have you performed the LD pruning step already?

Best, Xihao

samreenzafer commented 7 months ago

Hi Xihao, I have only these 13 variants in the known_loci . They are rare variants, and I don't want to prune them, since I specifically want to identify if there are other samples in my cohort with other variants that could be contributing to my phenotype, after conditioning on these 13 variants.

Thanks.

samreenzafer commented 7 months ago

I did try to run the STAARpipelineSummary_Known_Loci_Pruning.r script, but it gave me a NULL LDpruned file at the end. Maybe this is what is leading the STAARpipelineSummary_Gene_Centric_Noncoding.r script also to just keep on running? Although I don't understand why no LDpruned variants are being identified .

> 
> Rscript  scripts/STAARpipelineSummary_Known_Loci_Pruning.r 2
>          used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 272200 14.6     665751 35.6   411493 22.0
> Vcells 456006  3.5    8388608 64.0  1820273 13.9
> [1] 2
> [1] "Input variants"
>    CHR       POS REF ALT
> 1    2 168944600   C   T
> 2    2 168976689   T   G
> 3    2 168973831   C   A
> 4    2 168976617   T   C
> 5    2 168979945   A   G
> 6    2 168990819   T   C
> 7    2 168993859   G   A
> 8    2 168996703   C   T
> 9    2 168996703   C   T
> 10   2 169013329   G   T
> 11   2 168970065   C   T
> 12   2 168990800   C   T
> 13   2 168995461   C   T
> # of selected samples: 483
> # of selected variants: 10
> [1] "LD pruned variants"
> NULL
> 
>

I modified the script to print both data before and after calling the LD_pruning function as below.

print("Input variants") print(variants_list) known_loci <- LD_pruning(chr=chr,genofile=genofile,obj_nullmodel=obj_nullmodel,variants_list=variants_list,maf_cutoff=maf_cutoff, method_cond=method_cond,QC_label=QC_label, variant_type=variant_type,geno_missing_imputation=geno_missing_imputation)

known_loci_chr <- rbind(known_loci_chr,known_loci) print("LD pruned variants") print(known_loci_chr)

xihaoli commented 7 months ago

Hi @samreenzafer,

Thanks for sharing more details. There are two relevant but separate things to consider:

LD_pruning: In LD pruning procedure, we have included a parameter maf_cutoff to only keep the variants with a given minimum minor allele frequency. Typically, this parameter is set to be 0.01 (by default, to keep common or low-frequency variants) or 19.5/(2 * n) (which is equivalent to MAC >= 20). As you mentioned, these 13 variants are rare, but if you are using a maf_cutoff of 0.01, then none of the variants will remain as part of the output of LD_pruning. Another reason for having an output of LD_pruning being NULL is the cond_p_thresh parameter (default is 1e-04), where you may want to check the manual as well. As a minor note, I've recently updated the LD_pruning function to the STAARpipeline GitHub repository. Please make sure to install the latest version of the STAARpipeline package (although I don't think this is the cause of the results being NULL).
STAARpipelineSummary_Known_Loci_Pruning.r: I agree that if you only have 13 variants, then in general there is no need to prune them further. But based on the rule of thumb in (1), please check whether any of the 13 variants are extremely rare (i.e. MAC < 20), and in that case, you may not want to adjust for them.

Hope this helps.

Best, Xihao

samreenzafer commented 7 months ago

Thanks for the detailed response. 1) I do see that after pruning, by adjusting the maf_cutoff parameters, I get only 1 selected variant. Relaxing these parameters still give me the same variant.


MAFcutoff=1e-05 cond_p_thresh=1e-04
[1] "LD pruned variants"
  CHR       POS REF ALT
   2 168996703   C   T

MAFcutoff=1e-10 cond_p_thresh=1
[1] "LD pruned variants"
  CHR       POS REF ALT
   2 168996703   C   T

2) I went through the LD_pruning code and looks like the only difference in code are the left_join lines example; Old code that I had downloaded few weeks back. left_join(variants_list_chr,Gene_info,by=c(POS="POS",REF="REF",ALT="ALT") New code from your GitHub page left_join(variants_list_chr,Gene_info,by=c("POS"="POS","REF"="REF","ALT"="ALT") So maybe this is not the issue. And I only need to focus on the variants list. Going back to the variants list, my variants are super rare, with MAC=1 or 2, would you expect STAAR to not reach end of computation ?

xihaoli commented 7 months ago

Hi @samreenzafer,

Thanks for the input.

In this case, other variants in the input list can all be perfectly explained by this variant. You may perform a quick check by looking at the genotypes for all these 13 variants.
I am referring to line 175 of the LD_pruning function, where we additionally included QC_label parameter in Individual_Analysis_cond last week and is not the change you mentioned above.
I am still not sure why you would want to adjust these extremely rare variants (MAC = 1 or 2) in conditional analysis. Please refer to the STAARpipeline paper for the best practices of rare variant analysis.

Best, Xihao