zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
44 stars 12 forks source link

seqMerge failed because file2 is unsorted #41

Closed baderd closed 5 years ago

baderd commented 6 years ago

Dear zhengxwen,

Thanks a lot for your great packages on variants. They are super fast and bring the world of PLINK into the R/Bioc universe.

I tried to seqMerge() two GDS files (origninally created from VCF files). It failed because file2 "$chrom_pos" is not sorted. I looked into this and wondered, if this is the desired behavior:

Link to linke that throws the error: https://github.com/zhengxwen/SeqArray/blob/master/R/UtilsMerge.R#L121

"$chrom_pos" returns characters like "2:99", "10:10", "10:2". Since they are characters base::sort() would put them in this order: "10:10", "10:2", "2:99". Usually VCF files are not sorted that way.

Could you implement either a sorting function that fulfills the criteria required for merging or change the "is sorted" criteria?

Cheers, Daniel

zhengxwen commented 5 years ago

It is not a bug. The first GDS and the second GDS files should have the same variant order. is.unsorted checks the matching indices between these two files.

varidx[[i]] <- match(seqGetData(flist[[i]], "$chrom_pos"),
                variant.id)
if (is.unsorted(varidx[[i]], strictly=TRUE))
annaquaglieri16 commented 5 years ago

Thanks Zhengxwen! Do you suggest sorting all VCF outside of R/SeqArray? Or do you have a suggestion for dealing with this directly with SeqArray?

Thank you!

Anna

zhengxwen commented 5 years ago

I suggest sorting VCF outside of R/SeqArray using bcftools.

annaquaglieri16 commented 5 years ago

Thank you!!

I used bcftools to sort the VCF files and this got rid of the sorting problem.

Anna