Open gtollefson opened 1 year ago
Could you please show me the structure of your gds file? like,
f <- seqOpen(gds_filename)
f
Hi @zhengxwen,
Sorry for my delay - I needed to update my email alerts to be sent to my new institutional email address. I should be responsive more quickly now.
Here is the output from gdsfile <- seqOpen(gds_filename)
gdsfile Object of class "SeqVarGDSClass" File: /Users/george/Bailey_lab/miplicorn_dev_materials/variants.gds (1.1M)
- [ ] |--+ description [ ] |--+ sample.id { Str8 51 LZMA_ra(36.5%), 209B } |--+ variant.id { Int32 49763 LZMA_ra(7.10%), 13.8K } |--+ position { Int32 49763 LZMA_ra(6.89%), 13.4K } |--+ chromosome { Str8 49763 LZMA_ra(0.16%), 213B } |--+ allele { Str8 49763 LZMA_ra(5.29%), 15.5K } |--+ genotype [ ] | |--+ data { Bit2 1x51x132 LZMA_ra(49.1%), 833B } | |--+ extra.index { Int32 3x4482 LZMA_ra(4.08%), 2.1K } | --+ extra { Int16 4482 LZMA_ra(7.92%), 717B } |--+ phase [ ] |--+ annotation [ ] | |--+ id { Str8 49763 LZMA_ra(0.30%), 157B } | |--+ qual { Float32 49763 LZMA_ra(0.45%), 909B } | |--+ filter { Int32,factor 49763 LZMA_ra(0.09%), 189B } | |--+ info [ ] | | |--+ NS { Int32 49763 LZMA_ra(0.28%), 573B } | | |--+ DP { Int32 49763 LZMA_ra(0.79%), 1.5K } | | |--+ DPB { Float32 49763 LZMA_ra(0.35%), 705B } | | |--+ AC { Int32 159 LZMA_ra(40.6%), 265B } | | |--+ AN { Int32 49763 LZMA_ra(0.29%), 589B } | | |--+ AF { Float32 159 LZMA_ra(75.8%), 489B } | | |--+ RO { Int32 49763 LZMA_ra(0.37%), 745B } | | |--+ AO { Int32 159 LZMA_ra(49.4%), 321B } | | |--+ PRO { Float32 49763 LZMA_ra(0.25%), 513B } | | |--+ PAO { Float32 159 LZMA_ra(14.2%), 97B } | | |--+ QR { Float32 49763 LZMA_ra(0.43%), 869B } | | |--+ QA { Float32 159 LZMA_ra(71.4%), 461B } | | |--+ PQR { Float32 49763 LZMA_ra(0.25%), 513B } | | |--+ PQA { Float32 159 LZMA_ra(14.2%), 97B } | | |--+ SRF { Int32 49763 LZMA_ra(0.31%), 621B } | | |--+ SRR { Int32 49763 LZMA_ra(0.34%), 685B } | | |--+ SAF { Int32 159 LZMA_ra(32.4%), 213B } | | |--+ SAR { Int32 159 LZMA_ra(42.5%), 277B } | | |--+ SRP { Float32 49763 LZMA_ra(0.46%), 913B } | | |--+ SAP { Float32 159 LZMA_ra(79.6%), 513B } | | |--+ AB { Float32 159 LZMA_ra(84.0%), 541B } | | |--+ ABP { Float32 159 LZMA_ra(89.0%), 573B } | | |--+ RUN { Int32 159 LZMA_ra(14.2%), 97B } | | |--+ RPP { Float32 159 LZMA_ra(79.6%), 513B } | | |--+ RPPR { Float32 49763 LZMA_ra(0.46%), 917B } | | |--+ RPL { Float32 159 LZMA_ra(41.8%), 273B } | | |--+ RPR { Float32 159 LZMA_ra(41.8%), 273B } | | |--+ EPP { Float32 159 LZMA_ra(78.3%), 505B } | | |--+ EPPR { Float32 49763 LZMA_ra(0.46%), 917B } | | |--+ DPRA { Float32 159 LZMA_ra(94.7%), 609B } | | |--+ ODDS { Float32 49763 LZMA_ra(0.47%), 937B } | | |--+ GTI { Int32 49763 LZMA_ra(0.30%), 605B } | | |--+ TYPE { Str8 159 LZMA_ra(22.8%), 161B } | | |--+ CIGAR { Str8 159 LZMA_ra(43.1%), 317B } | | |--+ NUMALT { Int32 49763 LZMA_ra(0.27%), 549B } | | |--+ MEANALT { Float32 159 LZMA_ra(50.0%), 325B } | | |--+ LEN { Int32 159 LZMA_ra(24.2%), 161B } | | |--+ MQM { Float32 159 LZMA_ra(18.6%), 125B } | | |--+ MQMR { Float32 49763 LZMA_ra(0.26%), 525B } | | |--+ PAIRED { Float32 159 LZMA_ra(14.2%), 97B } | | |--+ PAIREDR { Float32 49763 LZMA_ra(0.25%), 513B } | | |--+ MIN_DP { Int32 49763 LZMA_ra(0.26%), 521B } | | |--+ END { Int32 49763 LZMA_ra(7.11%), 13.8K } | | --+ technology.ILLUMINA { Float32 159 LZMA_ra(18.6%), 125B } | --+ format [ ] | |--+ GQ [ ] | | --+ data { Float32 51x49763 LZMA_ra(4.42%), 437.9K } | |--+ GL [ ] | | --+ data { Float32 51x512 LZMA_ra(31.1%), 31.7K } | |--+ DP [ ] | | --+ data { VL_Int 51x49763 LZMA_ra(0.29%), 7.9K } | |--+ AD [ ] | | --+ data { VL_Int 51x284 LZMA_ra(19.7%), 6.3K } | |--+ RO [ ] | | --+ data { VL_Int 51x49763 LZMA_ra(0.45%), 12.1K } | |--+ QR [ ] | | --+ data { Float32 51x49763 LZMA_ra(4.39%), 435.4K } | |--+ AO [ ] | | --+ data { VL_Int 51x49797 LZMA_ra(0.27%), 7.0K } | |--+ QA [ ] | | --+ data { Float32 51x49797 LZMA_ra(0.12%), 11.6K } | --+ MIN_DP [ ] | --+ data { VL_Int 51x49638 LZMA_ra(0.24%), 6.4K } * --+ sample.annotation [ ]
@zhengxwen just following up on this issue. I’ve tried examining the source code but haven’t been able to identify the root of the error.
See:
| |--+ data { Bit2 1x51x132 LZMA_ra(49.1%), 833B } *
It is expected to be Bit2 1x51x49763
(or at least 49763). Here 1
is the ploidy, 51
is # of samples, and 49763
is # of variants (if more than 3 alleles at a site, this number could be larger than 49763).
There are somethings wrong when you convert VCF=>GDS (via seqVCF2GDS()
).
I might suggest checking the VCF file, whether it includes the field GT
correctly.
For haploid, the number of alleles can be more than 1.
Ok I see, thank you very much! I think I see the reason. Our analysis pipeline outputs a gVCF with every genomic position from our molecular inversion probe target region included in the VCF record lines even if there is no variant at each position. In our VCF, at positions without variants, there is no GT field. Each record row which has a variant allele at that position appears to have a properly formatted GT field. Only a small subset of the VCF records have a variant allele and the necessary GT field. Have you come across this before? Is there a way to specify in the seqVCF2GDS()
function call to only include VCF lines which have the GT command?
If not, perhaps we can get around this by including a simple grep script in the pipeline to remove VCF lines without variants (those lines without the GT field).
Here are two screenshots illustrating the different formatting between variant and non-variant record lines:
Removing VCF lines without variants (those lines without the GT field) would be a good way before importing to GDS. Another possible issue is the ploidy. From your data, it seems diploidy instead of haploid.
GT: GQ: DP: AD: RO: QR: AO: QA:GL 0/0:127.042:454:454,0:454:15248:0:0:0,-136.668,-1372.15
So I will expect a "corrected" VCF should give you a diploidy GDS file.
Ok I will proceed with removing the VCF lines without variants. It sounds like SeqArray does not support gVCF to GDS conversion. Is this correct?
In other words, SeqArray exclusively supports conversion of regular VCF files and not gVCF files which contain the homozygous reference blocks lacking GT calls which are added by various variant callers (GATK, Freebayes, etc.) during gVCF generation (which contain a line for every position in the genome including numerous non-variant sites)?
I think we have a fix to adjust the ploidy in the genotype calls - thank you for pointing that out.
Not sure how you define gVCF, but the gVCF files I received and processed did have GT for every position (even for the site without variant). Another option might be you add the GT field to every site in your pipeline.
Hi,
I've used SeqArray::seqVCF2GDS to convert a haploid species VCF v4.2 file to gds format which has variants with multiple alternative alleles due to the nature of our sequencing which involves sequencing combined mixed clonal samples. We require storing any number of alleles per variant position (2+) even though our reference genome is a haploid genome (showing more than 2 alleles is useful to us because it indicates mixed clones of one species in one sample).
I get an error when running various SeqArray functions and I'm wondering if the root cause is that the Ploidy is 1 but there are variants in the GDS file with more than 2 alternative alleles.
When I run
seqSummary(gds_filename)
I get the following output including a warning message which I believe explains why many of the SeqArray functions don't work on our data.Running
seqGetData(gdsfile, "genotype")
outputs:Running
seqGetData(gdsfile, "annotation/format/DS")
One more thing I noticed from the summary output is that under Alleles:
Why are there no ALT alleles recorded? and for tabulation, why do the alleles with 3, 4, 6, 7, 8, 1 show 0.0%?
Could you help me to identify the cause of these errors and help develop a fix? Do you think the GDS summary: Ploidy=1 while there are greater than 2 alleles is the issue? We would like to implement SeqArray into our R tool suite for molecular inversion probe analysis of mixed polyclonal parasite samples and hope we can get it working.
Thank you very much for your time and help, George