Error in seqGetData: Invalid dimension 'Start' and 'Length'

wbs-whuer commented 6 months ago

Hi Xihao,

Thanks for designing such a useful tool. I'm trying to run _STAARpipeline_Gene_CentricNoncoding.r in _Step 3.2: Gene-centric noncoding analysis_in my way, and the error below showed up:

i=1
Rscript STAARpipeline_Gene_Centric_Noncoding.r ${i}
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 273677 14.7     660845 35.3   451833 24.2
Vcells 459327  3.6    8388608 64.0  1800859 13.8
[1] 379
[1] 401
Error in seqGetData(genofile, paste0(Annotation_dir, Annotation_name_catalog$dir[which(Annotation_name_catalog$name ==  : 
  Invalid dimension 'Start' and 'Length'.
Calls: Gene_Centric_Noncoding -> noncoding -> seqGetData
Execution halted

And I run the code in noncoding.R line by line and it seems to report an error when I run

GENCODE.Category <- seqGetData(genofile, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")])).

In this code, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")]) refers to "annotation/info/FunctionalAnnotation/genecode_comprehensive_category". And I double check the information in my genofile, and "annotation/info/FunctionalAnnotation/genecode_comprehensive_category" is indeed in genofile.

r$> genofile
Object of class "SeqVarGDSClass"
File: ADNI.808_indiv.minGQ_21.pass.ADNI_ID.chr1.gds (108.7M)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 808 LZMA_ra(29.9%), 2.6K } *
|--+ variant.id   { Int32 2949744 LZMA_ra(3.99%), 459.6K } *
|--+ position   { Int32 2949744 LZMA_ra(34.4%), 3.9M } *
|--+ chromosome   { Str8 2949744 LZMA_ra(0.02%), 1009B } *
|--+ allele   { Str8 2949744 LZMA_ra(11.3%), 1.3M } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 2x808x2949744 LZMA_ra(3.23%), 36.8M } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
|  \--+ extra   { Int16 0 LZMA_ra, 18B }
|--+ phase   [  ]
|  |--+ data   { Bit1 808x2949744 LZMA_ra(0.01%), 42.5K } *
|  |--+ extra.index   { Int32 3x0 LZMA_ra, 18B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 18B }
|--+ annotation   [  ]
|  |--+ id   { Str8 2949744 LZMA_ra(33.6%), 6.1M } *
|  |--+ qual   { Float32 2949744 LZMA_ra(60.5%), 6.8M } *
|  |--+ filter   { Int32,factor 2949744 LZMA_ra(0.02%), 1.8K } *
|  |--+ info   [  ]
|  |  |--+ AC   { Int32 0 LZMA_ra, 18B } *
|  |  |--+ AF   { Float32 0 LZMA_ra, 18B } *
|  |  |--+ AN   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ BaseQRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ CCC   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ ClippingRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ DB   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ DP   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ DS   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ END   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ FS   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ GQ_MEAN   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ GQ_STDDEV   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ HWP   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ HaplotypeScore   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ InbreedingCoeff   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MLEAC   { Int32 0 LZMA_ra, 18B } *
|  |  |--+ MLEAF   { Float32 0 LZMA_ra, 18B } *
|  |  |--+ MQ   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MQ0   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ MQRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ NCC   { Int32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ NEGATIVE_TRAIN_SITE   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ POSITIVE_TRAIN_SITE   { Bit1 2949744 LZMA_ra(0.05%), 209B } *
|  |  |--+ QD   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ ReadPosRankSum   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ VQSLOD   { Float32 2949744 LZMA_ra(0.02%), 1.8K } *
|  |  |--+ culprit   { Str8 2949744 LZMA_ra(0.02%), 581B } *
|  |  \--+ FunctionalAnnotation   [ spec_tbl_df,tbl_df,tbl,data.frame,list ] *
|  |     |--+ VarInfo   { Str8 2946875 LZMA_ra(14.2%), 6.2M }
|  |     |--+ apc_conservation   { Float64 2946875 LZMA_ra(24.6%), 5.5M }
|  |     |--+ apc_epigenetics   { Float64 2946875 LZMA_ra(24.8%), 5.6M }
|  |     |--+ apc_epigenetics_active   { Float64 2946875 LZMA_ra(24.0%), 5.4M }
|  |     |--+ apc_epigenetics_repressed   { Float64 2946875 LZMA_ra(19.1%), 4.3M }
|  |     |--+ apc_epigenetics_transcription   { Float64 2946875 LZMA_ra(17.4%), 3.9M }
|  |     |--+ apc_local_nucleotide_diversity   { Float64 2946875 LZMA_ra(24.5%), 5.5M }
|  |     |--+ apc_mappability   { Float64 2946875 LZMA_ra(10.3%), 2.3M }
|  |     |--+ apc_protein_function   { Float64 2946875 LZMA_ra(2.23%), 514.1K }
|  |     |--+ apc_transcription_factor   { Float64 2946875 LZMA_ra(4.59%), 1.0M }
|  |     |--+ cage_tc   { Str8 2946875 LZMA_ra(3.13%), 100.6K }
|  |     |--+ metasvm_pred   { Str8 2946875 LZMA_ra(0.60%), 17.2K }
|  |     |--+ rsid   { Str8 2946875 LZMA_ra(17.3%), 764.3K }
|  |     |--+ fathmm_xf   { Float64 2946875 LZMA_ra(18.9%), 4.3M }
|  |     |--+ genecode_comprehensive_category   { Str8 2946875 LZMA_ra(5.23%), 503.0K }
|  |     |--+ genecode_comprehensive_info   { Str8 2946875 LZMA_ra(11.1%), 1.9M }
|  |     |--+ genecode_comprehensive_exonic_category   { Str8 2946875 LZMA_ra(0.86%), 26.3K }
|  |     |--+ genecode_comprehensive_exonic_info   { Str8 2946875 LZMA_ra(5.72%), 249.5K }
|  |     |--+ genehancer   { Str8 2946875 LZMA_ra(2.04%), 649.1K }
|  |     |--+ linsight   { Float64 2946875 LZMA_ra(10.7%), 2.4M }
|  |     |--+ cadd_phred   { Float64 2946875 LZMA_ra(8.62%), 1.9M }
|  |     \--+ rdhs   { Str8 2946875 LZMA_ra(7.30%), 346.1K }
|  \--+ format   [  ]
\--+ sample.annotation   [  ]

As I can run _STAARpipeline_Gene_CentricCoding.r without any error and I find there is a similar code in coding.R(line 64), which filter variants and samples before seqGetData(). And the code GENCODE.Category <- seqGetData(genofile, paste0(Annotation_dir,Annotation_name_catalog$dir[which(Annotation_name_catalog$name=="GENCODE.Category")])) can run successfully after restrict variants. There were 2,949,744 variants before filter in my genofile and I guess is it possible there was too much variants in my genofile which cause this error? Should I filter some variants (such as MAF or other quality control indicators) before run this code?

Thank you very much for your time and help, Bangsheng

xihaoli commented 6 months ago

Hi @wbs-whuer,

Thanks for your question. Based on your question, I have a quick comment for you to check:

In your aGDS file, thevariant.id, position, chromosome, allele, genotype and QC parameters (e.g. annotation/filter) show that there are 2,949,744 variants in the chromosome 21 file. However, all of your functional annotation fields (i.e. under annotation/info/FunctionalAnnotation) show 2,946,875 variants in the chromosome 21 file. As a sanity check before running STAARpipeline, these two numbers and the order of variants should be exactly the same. The current discrepancy between the two numbers (2,949,744 vs 2,946,875) could be the reason for the issue you mentioned above, and I would recommend you double-check on the FAVORannotator step once again.

Hope this helps.

Best, Xihao

wbs-whuer commented 6 months ago

Thanks for your reply! This error disappear after I restrict variants in genofile to those annotated in FAVORannotator. However, in your source code (such as noncoding.R), filtering in gds file (seqSetFilter(genofile, variant.id=variant.id.keep)) is reset many times (seqResetFilter(genofile)). Maybe I can extract variants annotated in FAVORannotator in my original vcf file and then convert it to gds file and annotate it once again. But I wonder is there any efficient way to avoid this error? Such as any method to modify the gds file and save it as a new gds file? Thank you!

xihaoli commented 6 months ago

Hi @wbs-whuer,

Thank you for your reply. Glad to hear that this is relevant to the cause of the issue. My suggestion is to double-check the FAVORannotator step since it is expected the number of variants in variant.id (and genotype etc.) will be the same as the functional annotations (e.g., genecode_comprehensive_category) after running FAVORannotator (even if there are variants in your original vcf file but not annotated in FAVOR database) such that all downstream STAARpipeline steps should work well without any issue.

If possible, could you please provide some examples of variants in your original vcf file but not annotated in FAVORannotator (the difference between 2,949,744 and 2,946,875)?

Best, Xihao

wbs-whuer commented 6 months ago

Thanks for your reply! I upload a file which contains variants that failed in the annotation step. And is seems that those alles are on the opposite strand. chr22_notmatch.txt

xihaoli commented 6 months ago

Hi @wbs-whuer,

Thanks for your updates. These allele matching issues should have been addressed once the GDS files are created (before FAVORannotator step), so you should consider flipping them as necessary in the original file format (e.g. VCF). Also, please double check your original file format and see if there are any specific issues (e.g. certain variants do not belong to a chromosome but are stored in the corresponding VCF files, etc.).

Best, Xihao

xihaoli / STAARpipeline-Tutorial

Error in seqGetData: Invalid dimension 'Start' and 'Length' #44