zhanxw / seqminer

Query sequence data (VCF/BCF1/BCF2, Tabix, BGEN, PLINK) in R
http://zhanxw.github.io/seqminer/
Other
30 stars 12 forks source link

seqminer is missing some SNPs in range compared to PLINK #17

Open garyzhubc opened 3 years ago

garyzhubc commented 3 years ago

I made a comparison between the PLINK2 glm result from the some range versus reading the dosage matrix with seqminer. The result from seqminer is missing some SNPs that appeared in the PLINK result from the same range.

PLINK

Prune the .bgen file with PLINK and do glm.

./plink2 --bfile ukb_imp_chr1_v3_pruned --from-bp 54505148 --to-bp 56530526  --chr 1 --out ukb_imp_chr1_v3_pruned_trimmed_2m --export bgen-1.2 bits=8
./plink2 --bgen ukb_imp_chr1_v3_pruned_trimmed_2m.bgen ref-first --sample ukb_imp_chr1_v3_pruned_trimmed_2m.sample --glm --pheno ldl_pcs_joined_with_geno.tsv --pheno-name LDL --covar ldl_pcs_joined_with_geno.tsv --covar-name PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10, PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20, PC21,PC22,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30, PC31,PC32,PC33,PC34,PC35,PC36,PC37,PC38,PC39,PC40 --out ukb_imp_chr1_v3_pruned_trimmed_2m

Inspect PLINK glm result and see which positions are used.

> ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear <- read_tsv("ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear.tsv") %>% filter(TEST=="ADD") %>% mutate(CHROM=as.integer(CHROM), POS=as.integer(POS))
> length(ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear $POS)
[1] 11984
> ukb_imp_chr1_v3_pruned_trimmed_2m.LDL.glm.linear $POS
   [1] 54505201 54505304 54505437 54505703 54505756 54505887 54506600 54507000 54507273 54507506 54507734 54507907 54507996
  [14] 54508051 54508231 54508394 54508564 54509028 54509259 54509316 54509594 54509616 54509786 54509957 54509994 54510577
  [27] 54510860 54511200 54511248 54511349 54512553 54512583 54513132 54513323 54513367 54513825 54513867 54513921 54514065
  [40] 54514261 54514341 54514570 54514675 54515041 54515365 54516228 54516280 54516334 54516526 54516601 54516792 54516899
  [53] 54517324 54517633 54517874 54518434 54518725 54519045 54519107 54519119 54519158 54519292 54519295 54519731 54519800
  [66] 54520603 54520921 54521284 54521434 54521855 54522000 54522240 54522280 54522307 54522331 54522355 54522645 54522830
  [79] 54523285 54523304 54523344 54523630 54523639 54523689 54523881 54524519 54524575 54524608 54524940 54525150 54525207

seqminer

Load the .bgen file as a dosage matrix with seqminer and inspect the data size. 5766 is way smaller than 11984. So about half the SNPs are missed by seqminer.

library(tidyverse)
library(seqminer)
ukb_imp_chr1_v3_pruned.bgen <- "ukb_imp_chr1_v3_pruned.bgen"
start_pos  <- 54505148
end_pos <- 56530526
geno <- t(readBGENToMatrixByRange(ukb_imp_chr1_v3_pruned.bgen, paste0("1:", start_pos,"-",end_pos))[[1]])
> print(dim(geno))
[1] 487409   5766
> colnames(geno)
   [1] "1:54505201" "1:54506600" "1:54507273" "1:54507506" "1:54507734"
   [6] "1:54507996" "1:54508051" "1:54508394" "1:54508564" "1:54509028"
  [11] "1:54509316" "1:54509594" "1:54509957" "1:54510860" "1:54511248"
  [16] "1:54511349" "1:54513132" "1:54513323" "1:54513367" "1:54513825"
  [21] "1:54513921" "1:54514261" "1:54514341" "1:54514675" "1:54515365"
  [26] "1:54516228" "1:54516280" "1:54516526" "1:54516792" "1:54516899"
  [31] "1:54517633" "1:54517874" "1:54518434" "1:54518725" "1:54519045"
Sofietb-work commented 1 year ago

Problem I have a similar problem to OP using the seqminer build from CRAN on Windows where seqminer misses SNPs from large VCF files, I'm not sure if the cause is the same.

Observed I'm importing ranges from vcf files (see range below) using readVCFToListByRange. It works fine on smaller vcf files (300 MB), but for the larger VCF files (9 GB and 35 GB) it only finds the first couple of SNPs, while I find more using bcftools query. Seqminer does not raise an error or warning. I wondered if readVCFToListByRange stops reading after a certain amount of data, or if there were errors in the larger vcf files. After concatenating the good vcf-file with itself a couple of times (using bcftools merge --force-samples), everything kept working fine until and including a filesize of 2.0 GB, but in the 2.4 GB file it only found the first 43 of the 49 SNPs present in the file.

Since the issue seems to occur from a filesize of approx. 2 GB, I'm wondering if there is some 32-bit component that limits memory. Does readVCFToListByRange try to read the entire file into memory before before filtering by range?

Expected I would expect readVCFToListByRange to read larger vcffiles if the memory can accommodate the resulting filtered dataset. If it is unable to, I would expect it to raise a warning.

Appendix Range: "chr1:196704632-196704632,chr1:196657064-196657064,chr1:196716375-196716375,chr1:196613173-196613173,chr1:196380158-196380158,chr1:196815450-196815450,chr1:196706642-196706642,chr1:196958651-196958651,chr2:228086920-228086920,chr3:64715155-64715155,chr3:99180668-99180668,chr3:99419853-99419853,chr4:110659067-110659067,chr4:110685820-110685820,chr5:39327888-39327888,chr5:35494448-35494448,chr6:31930462-31930462,chr6:31946792-31946792,chr6:32155581-32155581,chr6:31947027-31947027,chr6:43826627-43826627,chr7:104756326-104756326,chr7:99991548-99991548,chr8:23082971-23082971,chr9:76617720-76617720,chr9:73438605-73438605,chr9:101923372-101923372,chr9:107661742-107661742,chr10:24999593-24999593,chr10:124215565-124215565,chr12:56115778-56115778,chr12:112132610-112132610,chr13:31821240-31821240,chr14:68769199-68769199,chr14:68986999-68986999,chr15:58680954-58680954,chr15:58723939-58723939,chr16:56997349-56997349,chr16:56994528-56994528,chr16:75234872-75234872,chr17:26649724-26649724,chr17:79526821-79526821,chr19:6718387-6718387,chr19:6718146-6718146,chr19:5835677-5835677,chr19:1031438-1031438,chr19:45411941-45411941,chr19:45748362-45748362,chr20:44614991-44614991,chr20:56653724-56653724,chr22:33105817-33105817,chr22:38476276-38476276"