Open samreenzafer opened 7 months ago
Hi @samreenzafer,
Thanks for the question. If the input to known_loci
is particularly long (e.g. thousands of variants directly extracted from GWAS Catalog), then you should perform LD pruning to select the subset of independent variants from the full list for conditional analysis.
Have you performed the LD pruning step already?
Best, Xihao
Hi Xihao,
I have only these 13 variants in the known_loci
. They are rare variants, and I don't want to prune them, since I specifically want to identify if there are other samples in my cohort with other variants that could be contributing to my phenotype, after conditioning on these 13 variants.
Thanks.
I did try to run the STAARpipelineSummary_Known_Loci_Pruning.r
script, but it gave me a NULL LDpruned file at the end. Maybe this is what is leading the STAARpipelineSummary_Gene_Centric_Noncoding.r
script also to just keep on running?
Although I don't understand why no LDpruned variants are being identified .
>
> Rscript scripts/STAARpipelineSummary_Known_Loci_Pruning.r 2
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 272200 14.6 665751 35.6 411493 22.0
> Vcells 456006 3.5 8388608 64.0 1820273 13.9
> [1] 2
> [1] "Input variants"
> CHR POS REF ALT
> 1 2 168944600 C T
> 2 2 168976689 T G
> 3 2 168973831 C A
> 4 2 168976617 T C
> 5 2 168979945 A G
> 6 2 168990819 T C
> 7 2 168993859 G A
> 8 2 168996703 C T
> 9 2 168996703 C T
> 10 2 169013329 G T
> 11 2 168970065 C T
> 12 2 168990800 C T
> 13 2 168995461 C T
> # of selected samples: 483
> # of selected variants: 10
> [1] "LD pruned variants"
> NULL
>
>
I modified the script to print both data before and after calling the LD_pruning
function as below.
`
print("Input variants") print(variants_list) known_loci <- LD_pruning(chr=chr,genofile=genofile,obj_nullmodel=obj_nullmodel,variants_list=variants_list,maf_cutoff=maf_cutoff, method_cond=method_cond,QC_label=QC_label, variant_type=variant_type,geno_missing_imputation=geno_missing_imputation)
known_loci_chr <- rbind(known_loci_chr,known_loci) print("LD pruned variants") print(known_loci_chr)
`
Hi @samreenzafer,
Thanks for sharing more details. There are two relevant but separate things to consider:
LD_pruning
: In LD pruning procedure, we have included a parameter maf_cutoff
to only keep the variants with a given minimum minor allele frequency. Typically, this parameter is set to be 0.01 (by default, to keep common or low-frequency variants) or 19.5/(2 * n)
(which is equivalent to MAC >= 20
). As you mentioned, these 13 variants are rare, but if you are using a maf_cutoff
of 0.01, then none of the variants will remain as part of the output of LD_pruning
. Another reason for having an output of LD_pruning
being NULL
is the cond_p_thresh
parameter (default is 1e-04
), where you may want to check the manual as well. As a minor note, I've recently updated the LD_pruning
function to the STAARpipeline GitHub repository. Please make sure to install the latest version of the STAARpipeline package (although I don't think this is the cause of the results being NULL
).
STAARpipelineSummary_Known_Loci_Pruning.r
: I agree that if you only have 13 variants, then in general there is no need to prune them further. But based on the rule of thumb in (1), please check whether any of the 13 variants are extremely rare (i.e. MAC < 20
), and in that case, you may not want to adjust for them.
Hope this helps.
Best, Xihao
Thanks for the detailed response. 1) I do see that after pruning, by adjusting the maf_cutoff parameters, I get only 1 selected variant. Relaxing these parameters still give me the same variant.
MAFcutoff=1e-05 cond_p_thresh=1e-04
[1] "LD pruned variants"
CHR POS REF ALT
2 168996703 C T
MAFcutoff=1e-10 cond_p_thresh=1
[1] "LD pruned variants"
CHR POS REF ALT
2 168996703 C T
2) I went through the LD_pruning code and looks like the only difference in code are the left_join
lines example;
Old code that I had downloaded few weeks back.
left_join(variants_list_chr,Gene_info,by=c(POS="POS",REF="REF",ALT="ALT")
New code from your GitHub page
left_join(variants_list_chr,Gene_info,by=c("POS"="POS","REF"="REF","ALT"="ALT")
So maybe this is not the issue. And I only need to focus on the variants list.
Going back to the variants list, my variants are super rare, with MAC=1 or 2, would you expect STAAR to not reach end of computation ?
Hi @samreenzafer,
Thanks for the input.
In this case, other variants in the input list can all be perfectly explained by this variant. You may perform a quick check by looking at the genotypes for all these 13 variants.
I am referring to line 175 of the LD_pruning
function, where we additionally included QC_label
parameter in Individual_Analysis_cond
last week and is not the change you mentioned above.
I am still not sure why you would want to adjust these extremely rare variants (MAC = 1 or 2) in conditional analysis. Please refer to the STAARpipeline paper for the best practices of rare variant analysis.
Best, Xihao
Thanks for a great easy to use tool. I've finally reached the step where I would like to do a Gene Centric Noncoding conditional analysis on some select rare variants I identified in my data set. As per the tutorial, I am running the
STAARpipelineSummary_Gene_Centric_Noncoding.r
script, and this time give a nonNULL input to theknown_loci
variable.`
`
The script first generates the combined summary stats as before `
and then went on running for over 24hours before being killed, and never produced any of the
*_cond_sig.*
files. I'm wondering if this is indeed a very long step, and expected behavior ?