zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
101 stars 25 forks source link

remove.monosnp in snpgdsLDpruning() ignoring heterozygotes #53

Closed AAvalos82 closed 5 years ago

AAvalos82 commented 6 years ago

Noticed this issue downstream of a GWAS analysis using the GENESIS package, but the remove.monosnp option of snpgdsLDpruning() seems to be ignoring heterozygote calls. Post LD prunning I am not encountering any monomorphic SNPS that are homozygote REF or ALT alleles but still ~500 monomorphic heterozygotes remain.

zhengxwen commented 6 years ago

What are "monomorphic heterozygotes"? Are you saying one REF and one ALT in a genotype? In SNPRelate, monomorphic SNPs refer to the SNPs with MAF=0.

AAvalos82 commented 6 years ago

Pardon the confusion, I just realized the oxymoron in "monomorphic heterozygote".

I see now what the issue is. As defined, snpgdsLDpruning() will observe MAF=0, excluding SNPs that have a genotype call of "0" or "2" across the entire sample set. However I have instances where the SNP is heterozygote across all the samples. Thus the site is polymorphic with a MAF=0.5, but the genotype is homogeneous across the entire set, and thus non-informative.

I retract the issue as these instances fall outside the scope of snpgdsLDpruning() utility, and will just have to be mindful that this is an exemption when using the tool.

thierrygosselin commented 5 years ago

Use SeqArray::seqSNP2GDS to convert your SNPRelate dataset

require(SeqArray)
# x is your new converted seqarray dataset
variants <- SeqArray::seqGetData(gdsfile = x, var.name = "variant.id")
mono <- SeqArray::seqBlockApply(
      gdsfile = x,
      var.name = "$dosage",
      FUN = function(x) apply(
        X = x, MARGIN = 2 ,
        FUN = function(x) length(unique(x[!is.na(x)])) == 1
      ),
      margin = "by.variant",
      as.is = "unlist",
      parallel = 2,
      .progress = FALSE)
variants.mono <- variants[mono]
zhengxwen commented 5 years ago

Why not use seqApply():

library(SeqArray)

x <- seqOpen(seqExampleFileName("gds"))

variants <- seqGetData(gdsfile = x, var.name = "variant.id")
mono <- seqApply(x, "$dosage", function(g) all(g==1L, na.rm=T), parallel=2)
variants.mono <- variants[mono]
thierrygosselin commented 5 years ago

even better ! thanks

FloriaanDD commented 5 years ago

When I run the function above with SeqArray I receive this error.

Error in .DynamicClusterCall(cl, length(cl), .fun = function(.proc_idx,  : 
 One of the nodes produced an error: Can not open file. The process cannot access the file because it is being used by another process.

I don't receive this error if I use parallel = 1 I am using a Windows 10 PC, with R 3.5.1 in Rstudio 1.1.447