Open garyzhubc opened 3 years ago
Problem I have a similar problem to OP using the seqminer build from CRAN on Windows where seqminer misses SNPs from large VCF files, I'm not sure if the cause is the same.
Observed I'm importing ranges from vcf files (see range below) using readVCFToListByRange. It works fine on smaller vcf files (300 MB), but for the larger VCF files (9 GB and 35 GB) it only finds the first couple of SNPs, while I find more using bcftools query. Seqminer does not raise an error or warning. I wondered if readVCFToListByRange stops reading after a certain amount of data, or if there were errors in the larger vcf files. After concatenating the good vcf-file with itself a couple of times (using bcftools merge --force-samples), everything kept working fine until and including a filesize of 2.0 GB, but in the 2.4 GB file it only found the first 43 of the 49 SNPs present in the file.
Since the issue seems to occur from a filesize of approx. 2 GB, I'm wondering if there is some 32-bit component that limits memory. Does readVCFToListByRange try to read the entire file into memory before before filtering by range?
Expected I would expect readVCFToListByRange to read larger vcffiles if the memory can accommodate the resulting filtered dataset. If it is unable to, I would expect it to raise a warning.
Appendix Range: "chr1:196704632-196704632,chr1:196657064-196657064,chr1:196716375-196716375,chr1:196613173-196613173,chr1:196380158-196380158,chr1:196815450-196815450,chr1:196706642-196706642,chr1:196958651-196958651,chr2:228086920-228086920,chr3:64715155-64715155,chr3:99180668-99180668,chr3:99419853-99419853,chr4:110659067-110659067,chr4:110685820-110685820,chr5:39327888-39327888,chr5:35494448-35494448,chr6:31930462-31930462,chr6:31946792-31946792,chr6:32155581-32155581,chr6:31947027-31947027,chr6:43826627-43826627,chr7:104756326-104756326,chr7:99991548-99991548,chr8:23082971-23082971,chr9:76617720-76617720,chr9:73438605-73438605,chr9:101923372-101923372,chr9:107661742-107661742,chr10:24999593-24999593,chr10:124215565-124215565,chr12:56115778-56115778,chr12:112132610-112132610,chr13:31821240-31821240,chr14:68769199-68769199,chr14:68986999-68986999,chr15:58680954-58680954,chr15:58723939-58723939,chr16:56997349-56997349,chr16:56994528-56994528,chr16:75234872-75234872,chr17:26649724-26649724,chr17:79526821-79526821,chr19:6718387-6718387,chr19:6718146-6718146,chr19:5835677-5835677,chr19:1031438-1031438,chr19:45411941-45411941,chr19:45748362-45748362,chr20:44614991-44614991,chr20:56653724-56653724,chr22:33105817-33105817,chr22:38476276-38476276"
I made a comparison between the PLINK2 glm result from the some range versus reading the dosage matrix with seqminer. The result from seqminer is missing some SNPs that appeared in the PLINK result from the same range.
PLINK
Prune the .bgen file with PLINK and do glm.
Inspect PLINK glm result and see which positions are used.
seqminer
Load the .bgen file as a dosage matrix with seqminer and inspect the data size. 5766 is way smaller than 11984. So about half the SNPs are missed by seqminer.