Closed daniel-hui closed 1 year ago
Hi Daniel,
Apology not necessary and happy to help.
For 1), it seems like your id.genotype
is a NULL
object in R
to cause the error (lines 73-79). You may need to check whether the sample id in your genotype data
id.genotype <- seqGetData(genofile,"sample.id")
where genofile
is your opened aGDS file and the sample id in your phenotype data
phenotype.id <- as.character(obj_nullmodel$id_include)
are aligned well.
For 2), I am not that familiar with LSF cluster, but essentially you need to run
Rscript STAARpipeline_Gene_Centric_Noncoding.r <int number>
in a loop from {1..379} and the results will include all ~20,000 genes across 22 chromosomes. This is also the case for the rest of the analyses in 3.2.
Hope this helps.
Best, Xihao
Thank you Xihao. I was able to run STAARpipeline_Gene_Centric_Noncoding.r
in a loop to 379. However, the output Rdata files are all very small, 213 bytes or less, and the output for result fields seem to be "NULL". They are attached and uploaded here https://drive.google.com/file/d/1TXES3vdgXOvYGLbs-XNmbe92dtiL8jUT/view?usp=share_link if it's helpful -- if the issue isn't immediately clear please don't feel the need to take a close look, I may try just running everything cleanly from the beginning to see if that resolves things.
out_1.txt
Hi Daniel,
Thanks for sharing and I'll take a look. One thing to confirm- have you checked the sample id in both your genotype and phenotype data and made sure they are aligned? Also, is there an output log file (like the one you pasted above) that is available to share?
Best, Xihao
The sample IDs seem to be matched (there was some small issue before I was able to run STAARpipeline_Gene_Centric_Noncoding.r
but it ran afterward). I saved the stderr/stdout when I ran STAARpipeline_Gene_Centric_Noncoding.r
to out_1.txt
(couldn't upload the Rdata file to GitHub). Thanks.
Hi Daniel,
Upon checking the output you shared, it seems like the annotation/filter
channel in your aGDS file are all NA
's. As such, if you set QC_label <- "annotation/filter"
in the STAARpipeline_Gene_Centric_Noncoding.r script, you won't end up with any variant being the "PASS"
variant and thus your noncoding analysis results will be empty (NULL
). To address this issue, may I ask if all of the variants in your aGDS file have passed QC?
Best, Xihao
Are you saying that the SNPs should have the value "PASS" for the "FILTER" field in their VCFs? All the variants I'm using are post-QC but in the specific VCF files I input these values are all just ".". Should I change the value of QC_label
? I tried changing it to "" but it gave this error:
Error in seqGetData(genofile, QC_label) :
'' is not a standard variable name, and the standard format:
sample.id, variant.id, position, chromosome, allele, genotype
annotation/id, annotation/qual, annotation/filter
annotation/info/VARIABLE_NAME, annotation/format/VARIABLE_NAME
sample.annotation/VARIABLE_NAME, etc
Calls: Gene_Centric_Noncoding -> noncoding -> seqGetData
Execution halted
Hi Daniel,
The post-QC variants should have the value "PASS" in the final GDS/aGDS file for analysis. In your case, given all variants are post-QC, you can create a new channel in your GDS/aGDS file by setting the value of all variants as "PASS". To do this, below is an example script (I will include this note in the STAARpipeline-Tutorial):
library(gdsfmt)
library(SeqArray)
dir_geno <- "/path_to_the_GDS_file/"
gds_file_name_1 <- "freeze.5.chr"
gds_file_name_2 <- ".pass_and_fail.gtonly.minDP0.gds"
for (chr in 1:22){
print(paste("Chromosome:",chr))
gds.path <- paste0(dir_geno,gds_file_name_1,chr,gds_file_name_2)
genofile<-seqOpen(gds.path, readonly = FALSE)
#genofile
position <- as.integer(seqGetData(genofile, "position"))
length(position)
Anno.folder <- index.gdsn(genofile, "annotation/info")
add.gdsn(Anno.folder, "QC_label", val=factor(rep("PASS", length(position))), compress="LZMA_ra", closezip=TRUE)
#genofile
seqClose(genofile)
}
Then, in the STAARpipeline_Gene_Centric_Noncoding.r script, you can update it to
QC_label <- "annotation/info/QC_label"
Hope this is clear.
Best, Xihao
Ah great, thanks it seems to be working now. Do you have any recommendation on how to output the results to a nice tabular text format? For instance if I read in the Rdata file like this:
get(load("BMI_noncoding_1.Rdata"))
if I enter results_noncoding
I can see the results but don't seem to be able to write them to file using write.csv()
or write.table()
.
I can see the different annotation categories like results_noncoding$
but if I try to look at any individual one e.g., like results_noncoding$downstream
it outputs NULL
. Happy holidays.
Hi Daniel,
You're very welcome. After you have finished running the gene-centric noncoding analysis for the whole genome, you may follow the tutorial on Step 2.2: Summarize gene-centric noncoding analysis results, which provides necessary functions to automatically format results and create Manhattan/QQ plots, etc.
Happy holidays!
Best, Xihao
Hi Xihao,
Sorry to keep bothering you. I'm trying to run the association analyses now, specifically the rare non-coding analyses. I had a couple questions/issues:
1) As an example I ran the following:
Rscript STAARpipeline_Gene_Centric_Noncoding.r 1
which the full output/error is:
Would you know what the issue is? If it's any help, 6,280 is the number of samples with genetic data, and 9,400,898 is the number of variants on chromosome 1.
2) I saw the script
STAARpipeline_Gene_Centric_Noncoding.sh
to run STAARpipeline on a cluster. However this seems to be for SLURM, and our cluster uses LSF. Is essentially what is being done in this script and inSTAARpipeline_Gene_Centric_Noncoding.r
is it runs 50 genes each for 379 jobs? So if I just runSTAARpipeline_Gene_Centric_Noncoding.r
in a loop from {1..379} this should do the same thing? And likewise for the rest of the analyses in 3.2 for however many "SLURM_ARRAY_TASK_ID"s there are. Thanks.