zhengxwen / SeqArray

Data management of large-scale whole-genome sequence variant calls (Development version only)
http://www.bioconductor.org/packages/SeqArray
43 stars 12 forks source link

seqBED2GDS switches REF and ALT alleles #64

Closed smgogarten closed 2 years ago

smgogarten commented 4 years ago

The default format in PLINK2 is to list the ALT allele followed by the REF allele: https://www.cog-genomics.org/plink/2.0/formats

However, when seqBED2GDS reads in a set of plink files, it assumes the opposite, so REF and ALT are switched in the resulting GDS file.

I suggest the default is to follow the PLINK2 specification, with an argument to reverse the order of ref/alt (such as if the user has an older PLINK file, where there is no consistent allele order).

zhengxwen commented 3 years ago

seqBED2GDS() does not support PLINK2 file format. The PLINK2 format uses .pgen, .psam and .pvar as the input files, while PLINK1 uses .bim, .fam and .bed as the input. .pgen is totally different from .bed.

If you use the (very) old version of seqBED2GDS() (cannot remember), there may be an issue of allele flip. The latest seqBED2GDS() guarantees SNP import correctly.

smgogarten commented 3 years ago

According to the PLINK2 documentation, it also supports bed/bim/fam, and interprets column 5 in the .bim file as the ALT allele: https://www.cog-genomics.org/plink/2.0/formats#bim

seqBED2GDS() reads column 5 as the REF allele and column 6 as the ALT allele.

Example VCF file:

> bcftools query -f '%CHROM  %POS  %REF  %ALT\n' CEU_Exon.vcf.gz
1 1105366  T  C
1  1105411  G  A
1  1110294  G  A
1  3537996  T  C
1  3538692  G  C
1  3541597  C  T

Using PLINK1.9 to convert VCF to bed/bim/fam:

> plink --vcf CEU_Exon.vcf.gz --make-bed --out CEU_Exon
> head CEU_Exon.bim
1   rs111751804 0   1105366 C   T
1   rs114390380 0   1105411 A   G
1   rs1320571   0   1110294 A   G
1   rs2760321   0   3537996 T   C
1   rs2760320   0   3538692 C   G
1   rs116230480 0   3541597 T   C

For most variants, ALT is column 5 and REF is column 6. For variant 4 they are switched, presumably PLINK is coding column 5 as the minor allele?

Converting with seqBED2GDS():

> seqBED2GDS(bed.fn="CEU_Exon.bed", bim.fn="CEU_Exon.bim", fam.fn="CEU_Exon.fam", out.gdsfn="CEU_Exon.gds")
> seqSetFilter(gds, variant.sel=1:6)
> seqGetData(gds, "$ref")
[1] "C" "A" "A" "T" "C" "T"
> seqGetData(gds, "$alt")
[1] "T" "G" "G" "C" "G" "C"

REF and ALT are switched from the original VCF (except for variant 4).

The issue was prompted by this question on Bioconductor support: https://support.bioconductor.org/p/133249/

zhengxwen commented 3 years ago

SeqArray_1.28.0 has no such problem.

I modified seqBED2GDS() a lot in SeqArray_1.28.0, to add the parallel file conversion. I might have corrected this problem in this modification.

Best wishes,

Xiuwen