xihaoli / STAARpipeline-Tutorial

The tutorial for performing single-/multi-trait association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using FAVORannotator, STAARpipeline and STAARpipelineSummary
GNU General Public License v3.0
24 stars 17 forks source link

Error in "Step 3: Generate the annotated GDS (aGDS) file." #11

Closed daniel-hui closed 1 year ago

daniel-hui commented 1 year ago

Hi Xihao, we're trying to run STAARpipeline but am running into an issue in "Step 3: Generate the annotated GDS (aGDS) file". Below is the command testing on chromosome 22 with paths changed and the error:

Rscript gds2agds.R 22

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 1641932      22
Error in add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  The GDS node "apc_protein_function" exists.
Calls: add.gdsn -> add.gdsn
In addition: Warning messages:
1: One or more parsing issues, see `problems()` for details
2: In add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  Missing characters are converted to "".
3: In add.gdsn(ans, nm[i], val[[i]], compress = compress, closezip = closezip,  :
  Missing characters are converted to "".
Execution halted

If I run the same command again I actually get a different error, and it seems to stay like this (and the runtime also shortened to ~5 seconds from a couple minutes):

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 262458 14.1     641551 34.3   431252 23.1
Vcells 443021  3.4    8388608 64.0  1754506 13.4
[1] 1641932      22
Error in add.gdsn(Anno.folder, "FunctionalAnnotation", val = FunctionalAnnotation,  :
  The GDS node "FunctionalAnnotation" exists.
Execution halted

Would you know what the problem is? Thanks.

Daniel

xihaoli commented 1 year ago

Hi Daniel,

Thank you for your question. It seems like you have run some commands several times such that your AGDS file has already had the apc_protein_function channel, which caused the error. Could you paste the information on your chromosome 22 GDS file before running "Step 3: Generate the annotated GDS (aGDS) file"?

Best, Xihao

daniel-hui commented 1 year ago

Thanks for getting back to me. I remade the chr22 GDS file and uploaded it here https://drive.google.com/file/d/19YOYDN7A7Fodyrce_IkH2e6uG5ByKQKX/view?usp=share_link (it is different than the chr22 GDS file after I tried running step 3). This is the command and output when I remade the GDS file:

Rscript /project/ritchie07/personal/daniel/tools/STAARpipeline/convertVCF2GDS.R NULL vcf chr22_mac1_GDS 1 /project/ritchie07/personal/daniel/A6K/chr22_mac1.vcf.gz

[1] "NULL"
[2] "vcf"
[3] "chr22_mac1_GDS"
[4] "1"
[5] "/project/ritchie07/personal/daniel/A6K/chr22_mac1.vcf.gz"
[1] "/project/ritchie07/personal/daniel/A6K/chr22_mac1.vcf.gz"
Loading required package: gdsfmt
Running with 28 thread(s).
converting VCF
Tue Dec  6 11:52:49 2022
Variant Call Format (VCF) Import:
    file(s):
        chr22_mac1.vcf.gz (442.8M)
    file format: VCFv4.2
    the number of sets of chromosomes (ploidy): 2
    the number of samples: 6,280
    genotype storage: bit2
    compression method: LZMA_RA
    # of samples: 6280
    calculating the total number of variants ...
    the total number of variants for import: 1,641,932
    Writing to 28 files:
        chr22_mac1_GDS_tmp01_79b76e0809c [1..58,640]
        chr22_mac1_GDS_tmp02_79b743259777 [58,641..117,282]
        chr22_mac1_GDS_tmp03_79b739469439 [117,283..175,922]
        chr22_mac1_GDS_tmp04_79b773c1248 [175,923..234,564]
        chr22_mac1_GDS_tmp05_79b760d4a4e4 [234,565..293,204]
        chr22_mac1_GDS_tmp06_79b718ae12bc [293,205..351,846]
        chr22_mac1_GDS_tmp07_79b7250da3bb [351,847..410,486]
        chr22_mac1_GDS_tmp08_79b7393bbfa8 [410,487..469,126]
        chr22_mac1_GDS_tmp09_79b76daf5ccb [469,127..527,768]
        chr22_mac1_GDS_tmp10_79b7320d43c1 [527,769..586,408]
        chr22_mac1_GDS_tmp11_79b732eec028 [586,409..645,050]
        chr22_mac1_GDS_tmp12_79b72971da6b [645,051..703,690]
        chr22_mac1_GDS_tmp13_79b762beae63 [703,691..762,332]
        chr22_mac1_GDS_tmp14_79b76832b9ca [762,333..820,972]
        chr22_mac1_GDS_tmp15_79b712c800a5 [820,973..879,612]
        chr22_mac1_GDS_tmp16_79b7670dd1a3 [879,613..938,254]
        chr22_mac1_GDS_tmp17_79b713ae2e68 [938,255..996,894]
        chr22_mac1_GDS_tmp18_79b74ffdc65c [996,895..1,055,536]
        chr22_mac1_GDS_tmp19_79b749b31d96 [1,055,537..1,114,176]
        chr22_mac1_GDS_tmp20_79b73144c505 [1,114,177..1,172,818]
        chr22_mac1_GDS_tmp21_79b743cb1ae1 [1,172,819..1,231,458]
        chr22_mac1_GDS_tmp22_79b72aa7f3ff [1,231,459..1,290,098]
        chr22_mac1_GDS_tmp23_79b77a12170c [1,290,099..1,348,740]
        chr22_mac1_GDS_tmp24_79b751c949b4 [1,348,741..1,407,380]
        chr22_mac1_GDS_tmp25_79b72d0e378c [1,407,381..1,466,022]
        chr22_mac1_GDS_tmp26_79b79b35265 [1,466,023..1,524,662]
        chr22_mac1_GDS_tmp27_79b717fa32a2 [1,524,663..1,583,304]
        chr22_mac1_GDS_tmp28_79b77536717a [1,583,305..1,641,932]
    Done (Tue Dec  6 11:55:49 2022).
Output:
    chr22_mac1_GDS.gds
Merging:
    opening 'chr22_mac1_GDS_tmp01_79b76e0809c' ... [done]
    opening 'chr22_mac1_GDS_tmp02_79b743259777' ... [done]
    opening 'chr22_mac1_GDS_tmp03_79b739469439' ... [done]
    opening 'chr22_mac1_GDS_tmp04_79b773c1248' ... [done]
    opening 'chr22_mac1_GDS_tmp05_79b760d4a4e4' ... [done]
    opening 'chr22_mac1_GDS_tmp06_79b718ae12bc' ... [done]
    opening 'chr22_mac1_GDS_tmp07_79b7250da3bb' ... [done]
    opening 'chr22_mac1_GDS_tmp08_79b7393bbfa8' ... [done]
    opening 'chr22_mac1_GDS_tmp09_79b76daf5ccb' ... [done]
    opening 'chr22_mac1_GDS_tmp10_79b7320d43c1' ... [done]
    opening 'chr22_mac1_GDS_tmp11_79b732eec028' ... [done]
    opening 'chr22_mac1_GDS_tmp12_79b72971da6b' ... [done]
    opening 'chr22_mac1_GDS_tmp13_79b762beae63' ... [done]
    opening 'chr22_mac1_GDS_tmp14_79b76832b9ca' ... [done]
    opening 'chr22_mac1_GDS_tmp15_79b712c800a5' ... [done]
    opening 'chr22_mac1_GDS_tmp16_79b7670dd1a3' ... [done]
    opening 'chr22_mac1_GDS_tmp17_79b713ae2e68' ... [done]
    opening 'chr22_mac1_GDS_tmp18_79b74ffdc65c' ... [done]
    opening 'chr22_mac1_GDS_tmp19_79b749b31d96' ... [done]
    opening 'chr22_mac1_GDS_tmp20_79b73144c505' ... [done]
    opening 'chr22_mac1_GDS_tmp21_79b743cb1ae1' ... [done]
    opening 'chr22_mac1_GDS_tmp22_79b72aa7f3ff' ... [done]
    opening 'chr22_mac1_GDS_tmp23_79b77a12170c' ... [done]
    opening 'chr22_mac1_GDS_tmp24_79b751c949b4' ... [done]
    opening 'chr22_mac1_GDS_tmp25_79b72d0e378c' ... [done]
    opening 'chr22_mac1_GDS_tmp26_79b79b35265' ... [done]
    opening 'chr22_mac1_GDS_tmp27_79b717fa32a2' ... [done]
    opening 'chr22_mac1_GDS_tmp28_79b77536717a' ... [done]
Digests:
    sample.id  [md5: a761962496b6b317bf251960be9c76b7]
    variant.id  [md5: 819a750296c70995fba8b9748ceec990]
    position  [md5: 950041008e64c71f6f9187d2c86da0e0]
    chromosome  [md5: b78a494dc5be8a12482aaacfa00b65c0]
    allele  [md5: 495a3512d3c6c197209ad91c86564c2e]
    genotype  [md5: 507c9f68d3039161f84c086de22588c3]
    phase  [md5: 13706a839e623a3b95e55afef017faec]
    annotation/id  [md5: 47b0eafc0f027da5320cfdc0a7efd78d]
    annotation/qual  [md5: 9d8f45b58e47bd77724a8b8cfde5a0a6]
    annotation/filter  [md5: 518197a19b03713e21a5fc174926226d]
    annotation/info/PR  [md5: b63f542998b4e725f47060b84b2cb3e8]
Done.
Tue Dec  6 11:56:56 2022
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'chr22_mac1_GDS.gds' (114.5M)
    # of fragments: 269
    save to 'chr22_mac1_GDS.gds.tmp'
    rename 'chr22_mac1_GDS.gds.tmp' (114.5M, reduced: 2.5K)
    # of fragments: 56
Tue Dec  6 11:56:58 2022
File: /project/ritchie07/personal/daniel/A6K/STAARpipeline/chr22_mac1_GDS.gds
Format Version: v1.0
Reference: unknown
Ploidy: 2
Number of samples: 6,280
Number of variants: 1,641,932
Chromosomes:
    Chr22: 1641932
Contigs:
    22, 50808250
Alleles:
    ALT: <None>
    tabulation: 2, 1641932(100.0%)
Annotation, Quality:
    Min: NA, 1st Qu: NA, Median: NA, Mean: NaN, 3rd Qu: NA, Max: NA, NA's: 1641932
Annotation, FILTER:
    <None>
Annotation, INFO variable(s):
    PR, 0, Flag, Provisional reference allele, may not be based on real reference genome
Annotation, FORMAT variable(s):
    GT, 1, String, Genotype
Annotation, sample variable(s):
    <None>
xihaoli commented 1 year ago

Hi Daniel,

Thanks for including the output log of generating the GDS files. These GDS files should be the Step 1 input of the FAVORannotator program. Now given you have run Step 1 and Step 2 of FAVORannotator successfully, could you please make a copy of these GDS files, and rerun Step 3 of FAVORannotator on top of this copy?

Please let us know if you encounter this same issue (i.e., The GDS node "apc_protein_function" exists) again.

Best, Xihao

daniel-hui commented 1 year ago

I just tried re-running Step 3 using the new chr22 GDS file but unfortunately had the same The GDS node "apc_protein_function" exists issue.

xihaoli commented 1 year ago

Hi Daniel,

Thanks for letting me know. In this case, could you please paste the output of head(FunctionalAnnotation), dim(FunctionalAnnotation), and colnames(FunctionalAnnotation) when running through this line of the Step 3 script?

Best, Xihao

daniel-hui commented 1 year ago

Thanks again for the help -- below are the commands and their outputs:

image

image

image

xihaoli commented 1 year ago

Hi Daniel,

This is very helpful. You seemed to be using the FAVOR Full Database to annotate the GDS file. However, you should use the FAVOR Essential Database to annotate the GDS file in Step 2 of FAVORannotator.

Hope this helps, and please let me know how it goes. Thank you.

Best, Xihao

daniel-hui commented 1 year ago

Hi Xihao,

Thanks a lot, it seems to be working now. I'll check back if I'm having other issues.

xihaoli commented 1 year ago

Hi Daniel,

Thanks so much for letting me know.

Best, Xihao