zhengxwen / SNPRelate

R package: parallel computing toolset for relatedness and principal component analysis of SNP data (Development version only)
http://www.bioconductor.org/packages/SNPRelate
98 stars 25 forks source link

SNPRelate snpgdsLDMat and snpgdsLDpruning problem #55

Closed thierrygosselin closed 5 years ago

thierrygosselin commented 5 years ago

with the test dataset sent by email:

require(SNPRelate)
require(magrittr)
require(rlang)
data <- SeqArray::seqOpen(gds.fn = "thierry_test_ld.gds")

1. Generating the LD mat

ld.res <- SNPRelate::snpgdsLDMat(
  data,
  slide = -1,
  method = "r",
  mat.trim = FALSE,
  num.thread = 1,
  with.id = TRUE,
  verbose = TRUE)$LD
ld.res[lower.tri(ld.res, diag = TRUE)] <- rlang::na_dbl
ld.res <- abs(ld.res)
ld.res <- ld.res^2
ld.res <- as.vector(ld.res) %>% magrittr::extract(!is.na(.))
boxplot(ld.res)
LD_fig

2. LD Pruning

test.2 <- SNPRelate::snpgdsLDpruning(
  gdsobj = data,
  autosome.only = FALSE,
  remove.monosnp = TRUE,
  maf = NaN,
  missing.rate = NaN,
  method = "r",
  ld.threshold = sqrt(0.4),
  num.thread = 1L,
  verbose = TRUE) %>%
  unlist(.)

SNV pruning based on LD: Calculating allele counts/frequencies ... [==================================================] 100%, completed in 0s Excluding 218 SNVs (monomorphic: TRUE, MAF: NaN, missing rate: NaN) Working space: 50 samples, 782 SNVs using 1 (CPU) core sliding window: 500,000 basepairs, Inf SNPs |LD| threshold: 0.632456 method: R Chromosome chrom_1: 13.00%, 130/1,000 130 markers are selected in total.

Question 1:

With the threshold selected above: r2 = 0.4, we should not prune almost all markers. Based on the box plot and LD mat, it's all outliers above 0.4, the core (the IQR) of the data is below 0.1.

Question 2:

I am doing something wrong or there's a bug in SNPRelate::snpgdsLDMat or SNPRelate::snpgdsLDpruning

zhengxwen commented 5 years ago

snpgdsLDpruning() uses a sliding window method to prune SNPs, but snpgdsLDMat() returns LD of all pairs of SNPs.

thierrygosselin commented 5 years ago

Okay, I understand this, but then, how do you decide the threshold to prune your SNPs? Is there a way to visualize accurately the LD ?

Or the only solution is to prune SNPs using thresholds from 0 to 1, by = 0.1 and look at the whitelisted and blacklisted SNPs?

I know it's kind of a no brainer for human and model species, it's easier if you have a species with a reference genome and more difficult with denovo species...