zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
102 stars 25 forks source link

snpgdsLDpruning result to gds/vcf #69

Open linsson opened 4 years ago

linsson commented 4 years ago

Dear developer(s),

I wonder if there is any chance to get a gds (or vcf) file once a snp dataset (with multiple samples inside) has been pruned. Unfortunately I am not an R expert but it seems snpgdsLDpruning result is not a gds object but a (SNPs?) list.

Thanks in advance.

zhengxwen commented 4 years ago

You will need the SeqArray package to import the VCF file,

Here is the example for using snpgdsLDpruning() and seqExport() to export to a GDS file: http://www.bioconductor.org/packages/release/bioc/vignettes/SAIGEgds/inst/doc/SAIGEgds.html#preparing-snp-data-for-genetic-relationship-matrix

linsson commented 4 years ago

Thank you very much for your answer. Unfortunately I am receiving the following error

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘seqSetFilter’ for signature ‘"SNPGDSFileClass", "missing"’ Calls: seqSetFilter -> Execution halted

my code is:

input_GDS <- snpgdsOpen(input_GDS_name) residualSNPset=snpgdsLDpruning(input_GDS, ld.threshold=opt$pruning ) residualSNPset.id=unlist(residualSNPset, use.names=FALSE) # get the variant IDs of a LD-pruned set snpgdsClose(input_GDS) output_VCF_name = opt$output_VCF output_GDS_name = sub(".vcf", ".gds", output_VCF_name) seqSetFilter(input_GDS, variant.id=residualSNPset.id) seqExport(input_GDS, output_GDS_name, info.var=character(), fmt.var=character(), samp.var=character())

I usually use snpgdsOpen to load gds files with snprelate functions but then seqarray functions always complain about inheritance problems. What am I missing? can I open a gds file with seqOpen and then use snprelate funcions on it?

linsson commented 4 years ago

I had to use a sort of workaround:

`

SNPRelate GDS

input_VCF_name = opt$input_VCF input_GDS_name = sub(".vcf", ".gds", basename(input_VCF_name)) if (!file.exists(input_GDS_name)) { snpgdsVCF2GDS(input_VCF_name, input_GDS_name, method="copy.num.of.ref", ignore.chr.prefix="SL4.0ch") } input_GDS <- snpgdsOpen(input_GDS_name)

residualSNPset=snpgdsLDpruning(input_GDS, ld.threshold=opt$pruning ) residualSNPset.id=unlist(residualSNPset, use.names=FALSE) # get the variant IDs of a LD-pruned set snpgdsClose(input_GDS)

seqArray GDS

input_SeqArray_GDS_name = sub(".vcf", ".seqArray.gds", basename(input_VCF_name)) if (!file.exists(input_SeqArray_GDS_name)) { seqVCF2GDS(input_VCF_name, input_SeqArray_GDS_name) } input_SeqArray_GDS = seqOpen(input_SeqArray_GDS_name)

Create a genotype file for genetic relationship matrix (GRM) using the LD-pruned SNP set:

output_VCF_name = opt$output_VCF output_GDS_name = sub(".vcf", ".gds", output_VCF_name) seqSetFilter(input_SeqArray_GDS, variant.id=residualSNPset.id)

export to a GDS genotype file without annotation data

seqExport(input_SeqArray_GDS, output_GDS_name, info.var=character(), fmt.var=character(), samp.var=character())

close the file

seqClose(input_SeqArray_GDS)

output_GDS=seqOpen(output_GDS_name) seqGDS2VCF(output_GDS,output_VCF_name) seqClose(output_GDS) ` Basically I need to convert the initial vcf both in a gds file suitable for snprelate (I need the pruning and it only works if I have a snp gds) and seqarray. Then I take the snp ids from snpgdsLDpruning and can use seqArray functions (seqSetFilter, seqExport, seqGDS2VCF). What do you think? I hope both snpgdsVCF2GDS and seqVCF2GDS give the variants the same ids...

Thanks in advance

zhengxwen commented 4 years ago

You should just use SeqArray GDS. The GDS file defined in SeqArray is an extension of SNPRelate GDS. snpgdsLDpruning() can work on SeqArray GDS.

Best wishes,

Xiuwen

zhengxwen commented 4 years ago

See: http://www.bioconductor.org/packages/release/bioc/vignettes/SeqArray/inst/doc/SeqArray.html#integration-with-snprelate

linsson commented 4 years ago

I can't use just SeqArray GDS. This is what I get:

gds=seqOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds") residualSNPset=snpgdsLDpruning(gds, ld.threshold=0.1 ) SNV pruning based on LD: Excluding 173,502 SNVs on non-autosomes Calculating allele counts/frequencies ... Error in seqApply(f, "genotype", margin = "by.variant", as.is = "list", : There is no selected variant.`

While if I try to open a snp gds with seqOpen I get:

gds=snpgdsOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds") Error in snpgdsOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds") : Invalid SNP GDS file: please open the file using 'seqOpen()' in the SeqArray package, since 'FileFormat = SEQ_ARRAY'.

Anyhow seqArray gds and snprelate gds look not very similar (even if they are related):

` seqOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds")
Object of class "SeqVarGDSClass" File: /gporq2/scratch_0/usr/vfain/2019-09-24_shared_folder/analysis/2020-08-21_g_Rscripts_experimental/vcf2prunedVcf/G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.seqArray.gds (2.9G)

  • [ ] |--+ description [ ] |--+ sample.id { Str8 15161 LZMA_ra(6.43%), 6.8K } |--+ variant.id { Int32 173502 LZMA_ra(6.47%), 43.9K } |--+ position { Int32 173502 LZMA_ra(16.6%), 112.6K } |--+ chromosome { Str8 173502 LZMA_ra(0.03%), 461B } |--+ allele { Str8 173502 LZMA_ra(16.1%), 123.0K } |--+ genotype [ ] | |--+ data { Bit2 2x15161x178434 LZMA_ra(1.06%), 13.7M } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Int16 0 LZMA_ra, 18B } |--+ phase [ ] | |--+ data { Bit1 15161x173502 LZMA_ra(0.01%), 46.9K } | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } | --+ extra { Bit1 0 LZMA_ra, 18B } |--+ annotation [ ] | |--+ id { Str8 173502 LZMA_ra(0.10%), 181B } | |--+ qual { Float32 173502 LZMA_ra(77.5%), 525.0K } | |--+ filter { Int32,factor 173502 LZMA_ra(3.44%), 23.3K } | |--+ info [ ] | | |--+ AC { Int32 0 LZMA_ra, 18B } | | |--+ AF { Float32 0 LZMA_ra, 18B } | | |--+ AN { Int32 173502 LZMA_ra(0.04%), 257B } | | |--+ BaseQRankSum { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ DP { Int32 173502 LZMA_ra(0.04%), 257B } | | |--+ DS { Bit1 173502 LZMA_ra(0.64%), 145B } | | |--+ END { Int32 173502 LZMA_ra(0.04%), 257B } | | |--+ ExcessHet { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ FS { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ InbreedingCoeff { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ MLEAC { Int32 0 LZMA_ra, 18B } | | |--+ MLEAF { Float32 0 LZMA_ra, 18B } | | |--+ MQ { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ MQRankSum { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ QD { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ RAW_MQ { Float32 173502 LZMA_ra(0.04%), 257B } | | |--+ ReadPosRankSum { Float32 173502 LZMA_ra(0.04%), 257B } | | --+ SOR { Float32 173502 LZMA_ra(0.04%), 257B } | --+ format [ ] | |--+ AD [ ] | | --+ data { VL_Int 15161x391392 LZMA_ra(5.95%), 337.4M } | |--+ DP [ ] | | --+ data { VL_Int 15161x173502 LZMA_ra(12.2%), 305.8M } | |--+ GQ [ ] | | --+ data { VL_Int 15161x173502 LZMA_ra(11.4%), 535.3M } | |--+ MIN_DP [ ] | | --+ data { VL_Int 15161x0 LZMA_ra, 18B } | |--+ PGT [ ] | | --+ data { Str8 15161x125053 LZMA_ra(0.16%), 5.9M } | |--+ PID [ ] | | --+ data { Str8 15161x125053 LZMA_ra(0.21%), 7.7M } | |--+ PL [ ] | | --+ data { VL_Int 15161x659184 LZMA_ra(11.2%), 1.8G } | |--+ RGQ [ ] | | --+ data { VL_Int 15161x0 LZMA_ra, 18B } | --+ SB [ ] | --+ data { VL_Int 15161x0 LZMA_ra, 18B } * --+ sample.annotation [ ] `

` snpgdsOpen("G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.gds") File: /gporq2/scratch_0/usr/vfain/2019-09-24_shared_folder/analysis/2020-08-21_g_Rscripts_experimental/vcf2prunedVcf/G2PSOLMerge.final.keep.recode.renamed.keep15161acc.recode.MinMeanDP15.MaxMiss0.95.recode.gds (628.0M)

  • [ ] |--+ sample.id { Str8 15161 LZMA_ra(6.43%), 6.8K } |--+ snp.id { Int32 173502 LZMA_ra(6.47%), 43.9K } |--+ snp.rs.id { Str8 173502 LZMA_ra(0.10%), 181B } |--+ snp.position { Int32 173502 LZMA_ra(16.6%), 112.6K } |--+ snp.chromosome { Str8 173502 LZMA_ra(0.05%), 281B } |--+ snp.allele { Str8 173502 LZMA_ra(16.3%), 124.6K } |--+ genotype { Bit2 15161x173502, 627.2M } --+ snp.annot [ ] |--+ qual { Float32 173502 LZMA_ra(77.5%), 525.0K } --+ filter { Str8 173502 LZMA_ra(2.74%), 34.0K } `

In the first case you have variant.id node while in the second case you have snp.id (even if I am not sure that makes the difference). In your vignette it is not clear if you are using a gds or a snp gds. Maybe my problem is how I create the 2 gds:

seqarray gds: seqVCF2GDS(input_VCF_name, input_SeqArray_GDS_name) snp gds: snpgdsVCF2GDS(input_VCF_name, input_GDS_name, method="copy.num.of.ref", ignore.chr.prefix="SL4.0ch")

I can't see any "method" option for seqVCF2GDS; I will try to add ignore.chr.prefix="SL4.0ch" and see if I can use snpgdsLDpruning on it.

zhengxwen commented 4 years ago

SeqArray GDS is the second generation of SNPRelate GDS. Please use snpgdsLDpruning(..., autosome.only=FALSE), since the chromosome coding in your file is not numeric (not identified as autosomes).