xihaoli / STAARpipeline-Tutorial

The tutorial for performing single-/multi-trait association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary
GNU General Public License v3.0
21 stars 17 forks source link

Step 3.2: Gene-centric noncoding analysis #16

Closed daniel-hui closed 1 year ago

daniel-hui commented 1 year ago

Hi Xihao,

Sorry to keep bothering you. I'm trying to run the association analyses now, specifically the rare non-coding analyses. I had a couple questions/issues:

1) As an example I ran the following: Rscript STAARpipeline_Gene_Centric_Noncoding.r 1

which the full output/error is:

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 379
[1] 1
# of selected samples: 0
# of selected variants: 0
# of selected samples: 6,280
# of selected variants: 9,400,898
# of selected samples: 0
# of selected variants: 0
Error in data.frame(id.genotype, index = seq(1, length(id.genotype))) :
  arguments imply differing number of rows: 0, 2
Calls: Gene_Centric_Noncoding -> noncoding -> data.frame
Execution halted

Would you know what the issue is? If it's any help, 6,280 is the number of samples with genetic data, and 9,400,898 is the number of variants on chromosome 1.

2) I saw the script STAARpipeline_Gene_Centric_Noncoding.sh to run STAARpipeline on a cluster. However this seems to be for SLURM, and our cluster uses LSF. Is essentially what is being done in this script and in STAARpipeline_Gene_Centric_Noncoding.r is it runs 50 genes each for 379 jobs? So if I just run STAARpipeline_Gene_Centric_Noncoding.r in a loop from {1..379} this should do the same thing? And likewise for the rest of the analyses in 3.2 for however many "SLURM_ARRAY_TASK_ID"s there are. Thanks.

xihaoli commented 1 year ago

Hi Daniel,

Apology not necessary and happy to help.

For 1), it seems like your id.genotype is a NULL object in R to cause the error (lines 73-79). You may need to check whether the sample id in your genotype data

id.genotype <- seqGetData(genofile,"sample.id")

where genofile is your opened aGDS file and the sample id in your phenotype data

phenotype.id <- as.character(obj_nullmodel$id_include)

are aligned well.

For 2), I am not that familiar with LSF cluster, but essentially you need to run

Rscript STAARpipeline_Gene_Centric_Noncoding.r <int number>

in a loop from {1..379} and the results will include all ~20,000 genes across 22 chromosomes. This is also the case for the rest of the analyses in 3.2.

Hope this helps.

Best, Xihao

daniel-hui commented 1 year ago

Thank you Xihao. I was able to run STAARpipeline_Gene_Centric_Noncoding.r in a loop to 379. However, the output Rdata files are all very small, 213 bytes or less, and the output for result fields seem to be "NULL". They are attached and uploaded here https://drive.google.com/file/d/1TXES3vdgXOvYGLbs-XNmbe92dtiL8jUT/view?usp=share_link if it's helpful -- if the issue isn't immediately clear please don't feel the need to take a close look, I may try just running everything cleanly from the beginning to see if that resolves things. out_1.txt

xihaoli commented 1 year ago

Hi Daniel,

Thanks for sharing and I'll take a look. One thing to confirm- have you checked the sample id in both your genotype and phenotype data and made sure they are aligned? Also, is there an output log file (like the one you pasted above) that is available to share?

Best, Xihao

daniel-hui commented 1 year ago

The sample IDs seem to be matched (there was some small issue before I was able to run STAARpipeline_Gene_Centric_Noncoding.r but it ran afterward). I saved the stderr/stdout when I ran STAARpipeline_Gene_Centric_Noncoding.r to out_1.txt (couldn't upload the Rdata file to GitHub). Thanks.

xihaoli commented 1 year ago

Hi Daniel,

Upon checking the output you shared, it seems like the annotation/filter channel in your aGDS file are all NA's. As such, if you set QC_label <- "annotation/filter" in the STAARpipeline_Gene_Centric_Noncoding.r script, you won't end up with any variant being the "PASS" variant and thus your noncoding analysis results will be empty (NULL). To address this issue, may I ask if all of the variants in your aGDS file have passed QC?

Best, Xihao

daniel-hui commented 1 year ago

Are you saying that the SNPs should have the value "PASS" for the "FILTER" field in their VCFs? All the variants I'm using are post-QC but in the specific VCF files I input these values are all just ".". Should I change the value of QC_label? I tried changing it to "" but it gave this error:

Error in seqGetData(genofile, QC_label) :
  '' is not a standard variable name, and the standard format:
    sample.id, variant.id, position, chromosome, allele, genotype
    annotation/id, annotation/qual, annotation/filter
    annotation/info/VARIABLE_NAME, annotation/format/VARIABLE_NAME
    sample.annotation/VARIABLE_NAME, etc
Calls: Gene_Centric_Noncoding -> noncoding -> seqGetData
Execution halted
xihaoli commented 1 year ago

Hi Daniel,

The post-QC variants should have the value "PASS" in the final GDS/aGDS file for analysis. In your case, given all variants are post-QC, you can create a new channel in your GDS/aGDS file by setting the value of all variants as "PASS". To do this, below is an example script (I will include this note in the STAARpipeline-Tutorial):

library(gdsfmt)
library(SeqArray)

dir_geno <- "/path_to_the_GDS_file/"
gds_file_name_1 <- "freeze.5.chr"
gds_file_name_2 <- ".pass_and_fail.gtonly.minDP0.gds"

for (chr in 1:22){
  print(paste("Chromosome:",chr))
  gds.path <- paste0(dir_geno,gds_file_name_1,chr,gds_file_name_2)
  genofile<-seqOpen(gds.path, readonly = FALSE)
  #genofile

  position <- as.integer(seqGetData(genofile, "position"))
  length(position)
  Anno.folder <- index.gdsn(genofile, "annotation/info")
  add.gdsn(Anno.folder, "QC_label", val=factor(rep("PASS", length(position))), compress="LZMA_ra", closezip=TRUE)
  #genofile

  seqClose(genofile)
}

Then, in the STAARpipeline_Gene_Centric_Noncoding.r script, you can update it to

QC_label <- "annotation/info/QC_label"

Hope this is clear.

Best, Xihao

daniel-hui commented 1 year ago

Ah great, thanks it seems to be working now. Do you have any recommendation on how to output the results to a nice tabular text format? For instance if I read in the Rdata file like this:

get(load("BMI_noncoding_1.Rdata"))

if I enter results_noncoding I can see the results but don't seem to be able to write them to file using write.csv() or write.table().

I can see the different annotation categories like results_noncoding$ but if I try to look at any individual one e.g., like results_noncoding$downstream it outputs NULL. Happy holidays.

xihaoli commented 1 year ago

Hi Daniel,

You're very welcome. After you have finished running the gene-centric noncoding analysis for the whole genome, you may follow the tutorial on Step 2.2: Summarize gene-centric noncoding analysis results, which provides necessary functions to automatically format results and create Manhattan/QQ plots, etc.

Happy holidays!

Best, Xihao