Open linsson opened 4 years ago
You will need the SeqArray package to import the VCF file,
Here is the example for using snpgdsLDpruning()
and seqExport()
to export to a GDS file:
http://www.bioconductor.org/packages/release/bioc/vignettes/SAIGEgds/inst/doc/SAIGEgds.html#preparing-snp-data-for-genetic-relationship-matrix
Thank you very much for your answer. Unfortunately I am receiving the following error
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘seqSetFilter’ for signature ‘"SNPGDSFileClass", "missing"’
Calls: seqSetFilter ->
my code is:
input_GDS <- snpgdsOpen(input_GDS_name) residualSNPset=snpgdsLDpruning(input_GDS, ld.threshold=opt$pruning ) residualSNPset.id=unlist(residualSNPset, use.names=FALSE) # get the variant IDs of a LD-pruned set snpgdsClose(input_GDS) output_VCF_name = opt$output_VCF output_GDS_name = sub(".vcf", ".gds", output_VCF_name) seqSetFilter(input_GDS, variant.id=residualSNPset.id) seqExport(input_GDS, output_GDS_name, info.var=character(), fmt.var=character(), samp.var=character())
I usually use snpgdsOpen to load gds files with snprelate functions but then seqarray functions always complain about inheritance problems. What am I missing? can I open a gds file with seqOpen and then use snprelate funcions on it?
I had to use a sort of workaround:
`
input_VCF_name = opt$input_VCF input_GDS_name = sub(".vcf", ".gds", basename(input_VCF_name)) if (!file.exists(input_GDS_name)) { snpgdsVCF2GDS(input_VCF_name, input_GDS_name, method="copy.num.of.ref", ignore.chr.prefix="SL4.0ch") } input_GDS <- snpgdsOpen(input_GDS_name)
residualSNPset=snpgdsLDpruning(input_GDS, ld.threshold=opt$pruning ) residualSNPset.id=unlist(residualSNPset, use.names=FALSE) # get the variant IDs of a LD-pruned set snpgdsClose(input_GDS)
input_SeqArray_GDS_name = sub(".vcf", ".seqArray.gds", basename(input_VCF_name)) if (!file.exists(input_SeqArray_GDS_name)) { seqVCF2GDS(input_VCF_name, input_SeqArray_GDS_name) } input_SeqArray_GDS = seqOpen(input_SeqArray_GDS_name)
output_VCF_name = opt$output_VCF output_GDS_name = sub(".vcf", ".gds", output_VCF_name) seqSetFilter(input_SeqArray_GDS, variant.id=residualSNPset.id)
seqExport(input_SeqArray_GDS, output_GDS_name, info.var=character(), fmt.var=character(), samp.var=character())
seqClose(input_SeqArray_GDS)
output_GDS=seqOpen(output_GDS_name) seqGDS2VCF(output_GDS,output_VCF_name) seqClose(output_GDS) ` Basically I need to convert the initial vcf both in a gds file suitable for snprelate (I need the pruning and it only works if I have a snp gds) and seqarray. Then I take the snp ids from snpgdsLDpruning and can use seqArray functions (seqSetFilter, seqExport, seqGDS2VCF). What do you think? I hope both snpgdsVCF2GDS and seqVCF2GDS give the variants the same ids...
Thanks in advance
You should just use SeqArray GDS.
The GDS file defined in SeqArray is an extension of SNPRelate GDS.
snpgdsLDpruning()
can work on SeqArray GDS.
Best wishes,
Xiuwen
I can't use just SeqArray GDS. This is what I get:
gds=seqOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds")
residualSNPset=snpgdsLDpruning(gds, ld.threshold=0.1 ) SNV pruning based on LD: Excluding 173,502 SNVs on non-autosomes Calculating allele counts/frequencies ... Error in seqApply(f, "genotype", margin = "by.variant", as.is = "list", : There is no selected variant.`
While if I try to open a snp gds with seqOpen I get:
gds=snpgdsOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds") Error in snpgdsOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds") : Invalid SNP GDS file: please open the file using 'seqOpen()' in the SeqArray package, since 'FileFormat = SEQ_ARRAY'.
Anyhow seqArray gds and snprelate gds look not very similar (even if they are related):
` seqOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds")
Object of class "SeqVarGDSClass" File: /gporq2/scratch_0/usr/vfain/2019-09-24_shared_folder/analysis/2020-08-21_g_Rscripts_experimental/vcf2prunedVcf/G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds (2.9G)
- [ ] |--+ description [ ] |--+ sample.id { Str8 15161 LZMA_ra(6.43%), 6.8K } |--+ variant.id { Int32 173502 LZMA_ra(6.47%), 43.9K } |--+ position { Int32 173502 LZMA_ra(16.6%), 112.6K } |--+ chromosome { Str8 173502 LZMA_ra(0.03%), 461B } |--+ allele { Str8 173502 LZMA_ra(16.1%), 123.0K } |--+ genotype [ ] | |--+ data { Bit2 2x15161x178434 LZMA_ra(1.06%), 13.7M } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Int16 0 LZMA_ra, 18B } |--+ phase [ ] | |--+ data { Bit1 15161x173502 LZMA_ra(0.01%), 46.9K } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Bit1 0 LZMA_ra, 18B } |--+ annotation [ ] | |--+ id { Str8 173502 LZMA_ra(0.10%), 181B } | |--+ qual { Float32 173502 LZMA_ra(77.5%), 525.0K } | |--+ filter { Int32,factor 173502 LZMA_ra(3.44%), 23.3K } | |--+ info [ ] | | |--+ AC { Int32 0 LZMA_ra, 18B } | | |--+ AF { Float32 0 LZMA_ra, 18B } | | |--+ AN { Int32 173502 LZMA_ra(0.04%), 257B } | | |--+ BaseQRankSum { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ DP { Int32 173502 LZMA_ra(0.04%), 257B } | | |--+ DS { Bit1 173502 LZMA_ra(0.64%), 145B } | | |--+ END { Int32 173502 LZMA_ra(0.04%), 257B } | | |--+ ExcessHet { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ FS { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ InbreedingCoeff { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ MLEAC { Int32 0 LZMA_ra, 18B } | | |--+ MLEAF { Float32 0 LZMA_ra, 18B } | | |--+ MQ { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ MQRankSum { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ QD { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ RAW_MQ { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ ReadPosRankSum { Float32 173502 LZMA_ra(0.04%), 257B } | | --+ SOR { Float32 173502 LZMA_ra(0.04%), 257B } | --+ format [ ] | |--+ AD [ ] | | --+ data { VL_Int 15161x391392 LZMA_ra(5.95%), 337.4M } | |--+ DP [ ] | | --+ data { VL_Int 15161x173502 LZMA_ra(12.2%), 305.8M } | |--+ GQ [ ] | | --+ data { VL_Int 15161x173502 LZMA_ra(11.4%), 535.3M } | |--+ MIN_DP [ ] | | --+ data { VL_Int 15161x0 LZMA_ra, 18B } | |--+ PGT [ ] | | --+ data { Str8 15161x125053 LZMA_ra(0.16%), 5.9M } | |--+ PID [ ] | | --+ data { Str8 15161x125053 LZMA_ra(0.21%), 7.7M } | |--+ PL [ ] | | --+ data { VL_Int 15161x659184 LZMA_ra(11.2%), 1.8G } | |--+ RGQ [ ] | | --+ data { VL_Int 15161x0 LZMA_ra, 18B } | --+ SB [ ] | --+ data { VL_Int 15161x0 LZMA_ra, 18B } * --+ sample.annotation [ ] `
` snpgdsOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.gds") File: /gporq2/scratch_0/usr/vfain/2019-09-24_shared_folder/analysis/2020-08-21_g_Rscripts_experimental/vcf2prunedVcf/G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.gds (628.0M)
- [ ] |--+ sample.id { Str8 15161 LZMA_ra(6.43%), 6.8K } |--+ snp.id { Int32 173502 LZMA_ra(6.47%), 43.9K } |--+ snp.rs.id { Str8 173502 LZMA_ra(0.10%), 181B } |--+ snp.position { Int32 173502 LZMA_ra(16.6%), 112.6K } |--+ snp.chromosome { Str8 173502 LZMA_ra(0.05%), 281B } |--+ snp.allele { Str8 173502 LZMA_ra(16.3%), 124.6K } |--+ genotype { Bit2 15161x173502, 627.2M } --+ snp.annot [ ] |--+ qual { Float32 173502 LZMA_ra(77.5%), 525.0K } --+ filter { Str8 173502 LZMA_ra(2.74%), 34.0K } `
In the first case you have variant.id node while in the second case you have snp.id (even if I am not sure that makes the difference). In your vignette it is not clear if you are using a gds or a snp gds. Maybe my problem is how I create the 2 gds:
seqarray gds: seqVCF2GDS(input_VCF_name, input_SeqArray_GDS_name) snp gds: snpgdsVCF2GDS(input_VCF_name, input_GDS_name, method="copy.num.of.ref", ignore.chr.prefix="SL4.0ch")
I can't see any "method" option for seqVCF2GDS; I will try to add ignore.chr.prefix="SL4.0ch" and see if I can use snpgdsLDpruning on it.
SeqArray GDS is the second generation of SNPRelate GDS.
Please use snpgdsLDpruning(..., autosome.only=FALSE)
, since the chromosome coding in your file is not numeric (not identified as autosomes).
Dear developer(s),
I wonder if there is any chance to get a gds (or vcf) file once a snp dataset (with multiple samples inside) has been pruned. Unfortunately I am not an R expert but it seems snpgdsLDpruning result is not a gds object but a (SNPs?) list.
Thanks in advance.